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ABSTRACT 


For  statistical  classification  or  discrimination  among  two 
or  more  classes,  the  present  study  discusses  various  available  para¬ 
metric  and  nonparametric  classification  procedures  and  their  associated 
probability  of  correct  classification,  all  procedures  being  derived  from 
a  single  statistical  perspective  namely  by  maximizing  rather  obvious 
estimators  of  probability  of  correct  classification.  The  study  includes 
some  general  remarks  on  these  classification  procedures. 
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CHAPTER  I 


Introduction 


The  object  of  this  study  is  to  look  into  and  present  certain 
aspects  of  the  classification  problem,  including  various  classification 
procedures  discussed  in  the  literature.  The  problem  of  Discrimination, 
or  also  known  as  the  Identification  problem,  concerns  itself  with 
correctly  allocating  an  individual  into  one  of  a  specified  number  k 
of  populations.  The  Classification  problem,  on  the  other  hand,  is 
concerned  with  classifying  a  sample  of  individuals  into  groups,  which 
are  to  be  distinct  in  some  sense.  These  two  problems  basically  are 
virtually  the  same  in  nature.  We  shall  accordingly  be  using  in  the 
sequel  the  terms  discrimination  /  allocation  /  identification  /  assign¬ 
ment  as  synonyms  in  reference  to  the  same  classification  problem. 

In  principle,  the  classification  problem  is  one  of  the 
simplest  in  statistics;  in  practice  however,  it  has  a  large  number  of 
snags,  largely  because  the  assumed  theoretical  model  does  not  always 
reflect  the  practical  situation  sufficiently  closely.  The  problem  was 
considered  to  be  of  practical  importance  as  early  as  1935.  "Classifica¬ 
tion"  has  application  in  medical  diagnosis  and  treatment,  in  drug  inter¬ 
action  studies,  neurobiological  signal  processing,  sonar  detection  etc. 
Clinical  data,  such  as  electro-cardiograms  and  electro-encephalograms, 
can  also  be  analysed  and  classified  using  classification  techniques. 
Besides  medical  problems,  other  familiar  instances  where  such  a  problem 
arises  are: 


1 


. 


fagisl  sgens 

T 

1 


: 


2. 


(i)  When  an  anthropologist  faces  the  problem  of  sexing  the  skull 
or  jawbone; 

(ii)  When  a  taxonomist  is  assigned  the  problem  of  classifying  an 
organism  into  species  or  subspecies; 

(iii)  Authorship  of  a  disputed  article;  etc... 

Among  the  well-known  classification  procedures  developed,  are 
Fisher’s  linear  discriminant  function  introduced  by  Fisher  [1936]  and 
Anderson's  classification  statistic  introduced  by  Anderson  [1951].  It 
was  Welch  [1939],  who  gave  the  first  mathematical  formulation,  on  the 
basis  of  the  foundations  laid  by  Neyman  and  Pearson  in  the  theory  of 
testing  of  hypotheses.  Subsequent  authors  made  many  refinements  giving 
different  classification  statistics.  Rao  [1969],  in  his  paper,  considers 
the  extended  formulation  of  the  classification  problem  that  recognises 
the  possibility  of  an  individual  belonging  to  an  unspecified  population, 
as  for  example,  when  a  biologist  discovers  a  member  of  a  species.  In 
this  connection,  Srivastava  [1973]  proposed  the  "step-down"  procedure 
for  classification  into  one  of  two  multivariate  normal  populations. 
Relatively  very  little  has  been  done  in  the  area  of  multiple  group  dis¬ 
crimination.  Only  recently,  Lachenbruch  [1973]  has  proposed  two  methods 
for  classification  into  one  of  several  populations  and  has  studied  their 
relative  performance.  The  estimation  of  probabilities  of  misclassif ica- 
tion  has  been  studied  in  detail  by  Dunn  and  Varady  [1966],  Hills  [1966]. 
In  this  connection,  among  others  the  papers  by  Glick  [1972]  and  Lachen¬ 
bruch  and  Mickey  [1968]  should  be  mentioned. 
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In  Chapter  II,  we  give  a  detailed  account  of  all  available 
major  parametric  classification  procedures.  Section  2.2  deals  mainly 
with  rules  of  classification  into  known  distributions,  including  the 
well-known  Fisher’s  linear  discriminant  function  rule  and  Mahalanobis' 
generalized  squared  distance  rule.  Sample-based  classification  rules  are 
dealt  with  in  Section  2.3.  These  arise  when  the  distributions  are  not 
specified  completely  and  information  on  them  is  to  be  obtained  from  the 
samples.  The  chapter  includes  expressions  for  the  optimal  probability 
of  correct  classification.  A  review  of  the  literature  dealing  with  these 
classification  rules  and  the  associated  probabilities  of  misclassif ica- 
tion  is  also  given. 

Chapter  III  deals  with  the  non-parametric  classification 
problem.  The  required  estimation  of  probability  density  functions  in 
such  problems  has  been  discussed  in  detail  under  section  3.2.  The  prob¬ 
lem  of  density  estimation  has  received  attention  only  recently  in  the 
literature.  Fixed  window  density  estimates  were  suggested  by  Parzen 
[1962]  and  Cacoullos  [1966].  The  section  includes  Loftsgaarden  and 
Quesenberry ’s  [1965]  fixed  view  density  estimation  method  as  well.  In 
Section  3.3,  different  non-parametric  classification  procedures  availa¬ 
ble  in  the  literature  are  discussed.  These  rules  include  the  nearest 
•neighbor  rule  suggested  by  Fix  and  Hodges  [1951],  minimum  distance 
classification  rule  as  suggested  by  Das  Gupta  [1964] ,  the  best-count 
rule  proposed  by  Glick  [1969]  and  a  few  others. 

Chapter  IV  deals  mainly  with  the  mathematical  proofs  of  vari¬ 
ous  assertions,  made  in  Chapter  II  and  Chapter  III  on  parametric  and 
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non-parametric  classification  procedures,  and  the  associated  probability 
of  correct  classification. 

Finally,  in  Chapter  V,  we  make  some  general  remarks  on  Classi¬ 
fication  theory  which  may  be  of  importance  in  applications  and  further 
research  work. 


(For  computational  examples,  see  Appendix  I.) 


CHAPTER  II 


Parametric  Classification 


In  this  chapter,  we  introduce  some  major  parametric  rules  of 
classification  into  known  distributions  and  sample-based  classification 
rules.  All  these  rules  assume  the  existence  of  underlying  densities, 
with  parameters  known  or  unknown.  In  the  case  of  unknown  parameters, 
simple  estimates  of  parameters  prove  helpful  for  the  construction  of 
classification  procedures.  We  also  study  in  brief  the  probabilities 
of  correct  classification  discussed  in  the  literature. 

§2.1  Main  Formulations  of  the  Problem. 

(i)  Let  , ^2  * '  *  *  ,7Tlc  k  distinct  populations  (groups/cate¬ 

gories/classes).  Given  a  random  sample  from  an  unknown  population  7To  , 
but  known  to  be  one  of  ,tt^ , . .  .  ,71^  ,  the  problem  of  classification 
demands  a  decision,  as  to  which  one  of  the  latter  k  populations  is  7To  , 
that  is  optimum  in  some  sense.  Since  a  decision  rule  is  a  function 
from  the  sample  space,  X  ,  to  the  set  of  decisions,  , 71  £ » •  •  •  ,  it 

will  be  based  upon  the  observation  vector  X  ,  and  the  available  infor¬ 
mation  about  the  distributions  ti\  (i  =  l,2,...,k)  .  If  the  information 
is  unspecified  or  inadequate,  supplementary  information  can  be  obtained 
through  random  samples  from  each  of  the  k  populations;  such  samples 
being  termed  "training”  samples. 

(ii)  Suppose  there  is  a  population  V  ,  consisting  of  k  mutually 
exclusive  subpopulations  TT^^, . . .  ,Trk  mixed  in  respective  proportions 
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k 

(a  priori  probabilities)  .  . .  ,q^  (q  >0,l<i<k,  £  q  .  =  1)  , 

i=l 

known  or  unknown.  An  individual  selected  at  random  from  T  may  be 
regarded  as  a  random  vector  <I,X>  ,  where  I  denotes  the  individual's 
group,  and  X  is  the  p  -  dimensional  vector  of  measurements.  For  the 
units  to  be  classified,  I  is  unobservable,  but  X  can  be  observed. 

The  problem  of  classification  amounts  to  making  an  inference  on  the  value 
of  I  from  the  knowledge  of  X  .  The  distribution  of  I  is  over  the 
set  {l,2,...,k}  .  The  problem  will  be  termed  as  the  "known  mixture"  or 
"unknown  mixture"  problem  according  as  the  distribution  of  I  is  known 
or  unknown. 

In  constructing  a  classification  procedure,  it  is  desired  to 
minimize  the  expected  losses  or  the  probabilities  of  misclassifying  an 
individual.  A  procedure  which  achieves  this  minimum  is  called  the 
"best"  or  "optimal"  procedure. 

Remark  2.1:  In  the  preceding  formulation,  one  may  consider,  more  gener¬ 
ally,  I  as  a  continuous  or  discrete  variable  with  a  physical  meaning, 
and  the  population  7i\  corresponds  to  I  e  S_^  where  S^,S2»...,S^ 
is  a  partition  of  the  I  -  space.  Marshall  and  Olkin  [1968]  include 
the  decision  of  observing  I  along  with  making  k  decisions  in  their 


formulation. 
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§2.2  Classification  Into  Known  Distributions. 

2.2.1  Bayes  Procedure. 

Consider  the  formulation  (ii)  of  section  2.1  with  q^’s 
(1  £  i  £  k)  known.  On  the  basis  of  the  observed  X  =  x  ,  a  decision, 
optimum  in  the  sense  described  in  section  2.1,  has  to  be  reached  about 
the  membership  of  the  individual  in  one  of  k  specified  populations. 

The  probabilistic  structure  may  be  specified  by 

P{I=i}  =  q±  ,  1  <  i  <  k 

P [X<x | I=i]  =  Fi(x)  ,  l<i<k,xe#  . 

A  nonrandomized  decision  rule  D  consists  of  the  partition 
of  the  sample  space  X  into  k  mutually  exclusive  regions 

,  with  a  rule  which  assigns  an  individual,  with  measure¬ 
ment  vector  X  =  x  ,  into  the  ith  population  if  and  only  if  the 

* 

observed  x  e  ,  i  =  1, 2 , . . . ,k  .  Let  D  denote  the  collection  of 
all  classification  rules. 

Since  the  number  of  decisions  (classifications)  is  finite, 
attention  may  be  restricted  to  nonrandomized  decision  rules.  It  is  well 
known  that,  in  a  finite  decision  problem  the  optimal  solutions  for  the 
randomized  and  nonrandomized  rules  are  essentially  the  same.  (For  the 
definition  of  the  randomized  rules  and  the  proof  of  this  assertion,  see 


Rao  [1973]  section  7d.3.) 
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Let  denote  the  probability  densities  of 

F.j  , F2 , .  .  .  respectively,  with  respect  to  a  a  -  finite  measure  p  . 

Suppose  further,  that  a  loss,  C„(>0)  ,  is  incurred  in  assigning  an 
individual  from  the  ith  population  to  the  jth  population.  A  loss 
function  which  assigns  0  loss  to  correct  classification,  and  unit  loss 
to  any  misclassif ication,  is  called  a  simple  loss  function,  i.e.,  for  a 
simple  loss  function 


(2.2.1) 


if 


i 


3 


if  i  ^  j 


For  a  nonrandomized  rule,  the  expected  loss  in  applying  a 
given  rule  D  ,  when  in  fact  the  individuals  belong  to  the  ith  popula¬ 
tion  is 

k 

L.  =  l  C  f. (x)  dp(x)  ,  i  =  1,2, ... ,k  . 

3 


Knowing  the  prior  probabilities  ,  1  <  i  <  k  ,  the  expected  loss  of 

incorrectly  classifying  an  individual  from  the  mixed  population  T  , 
associated  with  D  ,  is 


(2.2.2) 


k 

P(D)  =  l  q.  L 
i=l 


k 


I 

j=l 


gj (x)  dp (x) } 


) 


where 


i  •  -:■> 
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(2.2.3) 


8j (x)  = 


k 

-  I 

1=1 


qi  Cij  fi(x) 


is  the  so  called  jth  discriminant  score  of  an  individual,  1  <  j  <  k  . 
Define  by  y(x)  the  maximum  of  the  discriminant  scores: 

(2.2.4)  y(x)  =  max  g.(x) 

l<j<k  3 

* 

The  Bayes  Rule  D  corresponding  to  a  given  a  priori  distribu¬ 
tion  (q^,q2, .  .  .  >9^.}  always  exists  and  consists  of  assigning  an  indivi¬ 
dual  to  that  population  for  which  his  discriminant  score,  defined  by 
(2.2.3),  is  the  highest.  (For  a  proof,  see  Rao  [1973]  p.  493  result  (i) ; 
or  Anderson  [1958]  section  6.6.)  An  optimal  partition  corresponding  to 
this  Bayes  rule  D  is  expressible  as  D  =  (D^  jD^  ,...,D^  }  where 


(2.2.5) 


(x  e  X  :  gj (x)  =  y(x)} 


1  <  j  <  k  . 


Ties  may  be  resolved  arbitrarily, 

* 

taking  x  e  D_.  if  and  only  if  j 
the  maximum  is  attained. 


e.g.,  specify  a  unique  partition  by 
is  the  smallest  integer  for  which 


As  a  particular  case,  consider  the  problem  of  classifying  an 
individual  into  one  of  two  specified  populations:  i.e.,  k  =  2  .  By 
the  preceding  arguments,  the  classification  problem  amounts  to  determin- 
ing  two  regions,  and  ,  which  minimize  the  expected  loss  (2.2.2). 


The  optimal  rule  is 
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(2.2.6) 


(x  e  X 


f-^x) 

f2(x) 


>  C) 


{x  e  X 


f-^x) 

f2(x) 


<  c} 
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C2lq2 

where  C  =  - -  depends  on  the  relative  losses  of  misclassif ication 

C12ql 

and  the  prior  probabilities.  The  case  when  f^x)  =  C  f2(x)  can  be 
resolved  in  some  arbitrary  manner,  such  as  flipping  a  coin  and  deciding 
that  an  individual  comes  from  tt^  or  tt  according  as  the  coin  shows 
a  head  or  tail. 


* 

Remark  2.2.  (i)  Let  p  =  inf  p(D) 

De»* 

* 

Then  the  optimal  Bayes  Rule  D  is  the  one  which  minimizes 
p(D)  ;  i, e.  the  optimal  Bayes  rule  D  satisfies 


•k  k 

p(D  )  =  p 


We  call  p  the  Bayes  risk. 


(ii)  In  many  practical  problems,  it  is  difficult  to  assess 
the  losses  due  to  wrong  classification.  In  such  cases,  simple  loss 
structure  is  assumed  and  L_^  (2.2.2)  represents  the  expected  proportion 

of  wrong  identifications  for  individuals  of  the  ith  population.  So 
the  criterion  of  minimizing  the  probabilities  of  misclassif ication  may 
serve  the  purpose,  and  g^ (x)  ,  the  jth  discriminant  score  defined  by 
(2.2.3),  reduces  to 


'  • 


. 


' 

■"  *0  -  V 


. 


■ 


' 


11. 
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g  (x)  =  -  l  q.  f  (x) 
J  i=l 


=  const  +  f  (x)  , 


i.e.  for 


this  purpose,  g . (x)  may  simply  be  defined  as  q.  f . (x) 

J  11 


2.2.2  Minimax  Rule. 

In  the  preceding  section,  the  formulation  (ii)  was 

considered  with  the  q.'s  ,  1  £  i  £  k  ,  known.  It  was  seen  that  the 

* 

optimal  Bayes  rule  D  depends  upon  the  prior  probabilities 
qrq  ...... q  .  In  most  instances  of  classification  problem,  prior 

probabilities  q^,q2»...,q^  are  not  known  to  the  statistician.  Rao 
[1969]  has  suggested  the  maximum  likelihood  method  for  estimating  these 
q^'s  ,  1  £  i  <  k  .  Such  a  problem  of  unknown  prior  probabilities  arises, 
for  example,  in  the  case  of  differential  diagnosis  of  diseases,  where 
the  diseases  may  exhibit  seasonal  variations.  It  is  not  possible,  in 
such  cases,  to  implement  an  optimal  rule  that  minimizes  the  expected 
loss.  Instead,  one  minimizes  the  maximum  risk.  This  criterion  is  the 
so-called  Minimax  Criterion.  The  determination  of  such  a  rule,  even  if 
it  exists,  may  be  difficult.  But  there  exist  situations  where  a  deci¬ 
sion  rule  may  be  identified  as  a  minimax  rule.  It  has  been  proved  that 
minimax  procedures  are  Bayes  solutions  with  respect  to  a  least  favoura¬ 
ble  ’a  priori'  distribution,  and  the  minimax  risk  equals  .the  so  called 
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maximum  Bayes  risk.  More  generally,  if  there  exists  no  such  prior 
distributuion  but  only  a  sequence  for  which  the  Bayes  risk  tends  to  the 
maximum,  then  the  minimax  procedures  are  limits  of  the  associated 
sequence  of  Bayes  solutions  (see  Lehmann  [1959]  p.  17,  or  Rao  [1973]  p. 
496). 


2.2.3  Linear  Discriminant  Function  Rule: 

The  linear  discriminant  function  rule  (LDF  rule),  for  classi¬ 
fying  an  individual  into  one  of  two  multivariate  normal  populations  with 
the  same  covariance  matrix,  was  first  introduced  by  Sir  Ronald  Fisher  in 
1936.  Fisher's  idea  was  the  basis  for  most  of  the  research  in  multi¬ 
variate  statistical  classification  theory.  The  method  of  finding  discri¬ 
minant  functions  in  arriving  at  test  criteria  for  classification  pro¬ 
cedures  has  been  found  extremely  useful  in  multivariate  analysis. 

Suppose  the  populations  have  multivariate  normal  distributions 
with  the  same  covariance  matrix  |  ,  but  different  mean  vectors.  The 
ith  density  (i=l,2)  is  given  by 

f±(x)  =  (2ir)~p/2  III"172  exP(-  |(X-y(l))'  }:'1(x-y(l))}  , 

where  (i=l,2)  denotes  the  mean  vector  of  the  two  populations. 

The  ratio  of  the  densities  is 

f1(x)  exp  {-  i(X-y(1))'  r1  (X-y(1))} 
f2(x)  exp  {-  |(X-y(2))'  f1  (X-y(2))} 

-  exp  {-  |  [(X-y(1))’  ?:'1(X-y(1))-(X-y(2))'  f^X-y (2)) ] }  . 


(2.2.7) 


sril  tmu<aixa«r 


■ 


' 


■ 
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Invoking  the  Bayes  classification  procedure  for  the  case  k  = 
(see  (2.2.6)),  the  region  of  classification  into  ,  is  the  set 

of  X's  for  which  the  right  hand  side  of  (2.2.7)  is  greater  than  C  . 
The  monotonicity  of  the  logarithmic  function  yields  (by  rearrangement) , 


(2.2.8)  Dj*  -  {X  £  *  :  U  =  X'  $_1(y(1)-y(2)) 

-  |(y(1)+y(2))'  rV1}V2))  >  log  C}  . 

The  first  term,  X'  $  »  is  the  well-known  Fisher's  linear 

discriminant  function,  a  function  linear  in  the  components  of  the  obser¬ 
vation  vector  X  . 


In  the  special  case  in  which  the  two  populations  are  equally 

likely,  and  the  losses  due  to  misclassif ication  are  equal,  C  =  1  (see 

(2.2.6)),  and  log  C  =  0  .  Then  the  region  of  classification  into  tt^  , 
* 

Dx  ,  is 


D. 


=  (X  e  X  :  X'  | 


-1,  (1)  (2), 

(y  -yv  ') 


^  1  ,  (n  (2) 

>2  (y  +y 


.  t  *-1.  (1)  (2),  . 

)'  (  (y  -y  ) ) 


If  the  a  priori  probabilities  are  unknown,  we  select  log  C  = 
k  ,  say,  by  making  the  expected  losses  due  to  the  wrong  classifications 
equal.  This  demands  the  knowledge  of  the  distribution  of  U  .  Ander¬ 
son  [1958]  and  subsequently  many  authors  studied  the  distribution  of  U 
It  is  well-known  that  U  is  distributed  as  N(^  ,  a)  when  X  is  dis¬ 
tributed  according  to  N(y^\$)  .  When  X  is  distributed  according 
to  N(y^),$)  t  u  is  distributed  as  N(-  ^  ,  a)  ,  where  * 


(  ,  i*  i  yd)  .t  iT  -otr>  »'*  •*  »*  f*.^**-"* 


1 


■ 


. 


The  probabilities  of  misclassif ication  are  (see  Anderson  [1958]) 


and 


P(2  1) 


f (k-a/2)//a  2. 

_1_  e-y  n 

-00  /2tt 


dy 


P(l|2) 


(k+a/2)//a  /2tt 


dy 


Thus,  for  the  minimax  solution,  we  choose  k  so  that 


,00 


'21 


(k+a/2)/i/a  /2tt 


~  e  7  /2  dy  =  C 


12 


•(k-a/ 2) //a  2  _y2/2 


—00 


/2tt 


dy  . 


A  special  representation  of  the  probability  of  correct  classi¬ 
fication  by  the  optimal  LDF  rule  is  given  in  section  2.2.6.  Marshall 
and  Olkin  [1968]  derived  Bayes  rule  for  the  normal  populations  in  their 
special  set-up,  pointed  out  earlier.  Further,  Anderson  and  Bahadur  [1962] 
considered  the  problem  when  the  two  multivariate  normal  populations  h^ve 
unequal  covariance  matrices.  The  likelihood-ratio  method  can  still  be 
used  but  it  does  not  lead  to  a  linear  discriminant  function.  The  dis¬ 
criminant  score  for  the  ith  population  is,  (i=l,2, . . . ,k) 

g±(x)  =  -  \  log  ItJ  -  |  (x-u(l))’  |1'1(X-y(l))  +  log  qx 

which  may  be  called  a  quadratic  discriminant  score.  The  decision  rule 
amounts  to  assigning  an  individual  to  that  population  for  which  his 


•_  ' 


•6J.  qo<j 
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quadratic  discriminant  score  is  the  highest.  Anderson  and  Bahadur  [1962] 
showed  that  no  linear  discriminant  function  can  be  an  optimal  rule.  They 
derived  the  minimax  rule  and  characterized  the  minimal  complete  class. 
After  restricting  to  the  class  of  rules  based  on  linear  functions  of  X  , 
they  also  established  that  among  all  the  linear  functions,  Fisher's  LDF 
minimizes  the  probabilities  of  misclassif ication.  Not  much  has  been 
studied  on  nonlinear  discriminants  subsequent  to  their  paper. 

Remark  2.3.  (i)  The  choice  of  discriminant  function  in  the  preceding 

discussions  is  not  unique.  We  can  always  multiply  a  discriminant  func¬ 
tion  by  a  positive  constant,  or  bias  it  by  an  additive  constant  without 
influencing  the  decision.  Consequently,  all  the  decision  rules  so 
obtained  are  equivalent. 

(ii)  The  extension  of  the  above  classification  problem  to  classi¬ 
fication  into  one  of  several  multivariate  normal  populations  is  dis¬ 
cussed  in  detail  in  Anderson  [1958].  The  underlying  idea  in  his  approach 
is  the  same;  namely,  an  ordered  partition  of  the  sample  space  X 
such  that  the  expected  loss  is  a  minimum.  For  a  detailed  discussion  of 
the  topic,  one  is  referred  to  Anderson  [1958,  pp.  147]. 

2; 2.4  Minimum  Distance  Rule. 

Consider  the  formulation  (i).  So  far,  in  all  the  above  classi¬ 
fication  procedures  it  was  assumed  that  the  individual  to  be  classified 
belongs  to  one  of  the  several  specified  populations.  This  assumption  is 
realistic  in  many  taxonomic  problems  such  as  sexing  of  skeletal  remains, 


rU 


, 


isn3lsvlup9  />£/}*  fildo 
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where  the  possibilities  of  identification  is  limited  to  two.  However, 
when  the  external  evidence  is  slight,  the  classification  is  subject  not 
only  to  error  due  to  misclassif ication,  but  also  due  to  the  possibly 
erroneous  assumption  that  it  belongs  to  one  of  the  specified  populations. 
In  order  to  have  a  better  justification  of  the  classification,  the  best 
procedure  would  be  to  first  test  whether  or  not  it  belongs  to  one  of 
the  given  populations.  Unfortunately,  no  such  test  criterion  is  avail¬ 
able.  Alternatively,  we  find  which  of  the  k  populations  is  "nearest" 
or  "closest",  measured  in  terms  of  some  distance  function,  to  the  indi¬ 
vidual  to  be  classified. 

An  example  in  which  the  usual  classification  approach  is  not 
pertinent  is  the  following: 


Suppose  a  relatively  new  language  is  to  be  compared  with  two 


or  more  older  languages:  The  purpose  is  to  find  which  of  these  languages 
is  most  similar  to  the  former.  If  a  measure  of  dissimilarity  in  terms 
of  a  distance  function  between  two  languages  is  available,  then  the 
question  of  the  nearest  to  the  new  one  is  quite  appropriate. 


This  leads  to  the  question  of  what  measure  of  distance  should 


be  used.  For  the  case  of  multivariate  normal  populations,  Mahalanobis 
[T936]  proposed  the  generalized  squared  distance  as  a  measure  of  diver¬ 
gence  between  the  populations.  The  divergence  is  given  by 


(2.2.9) 


6 


where  6^-  denotes  the  difference  in  true  mean  values  for  the  ith 


:•  „«w 
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common  or  pooled  covariance  matrix,  and  p  in  the  subscript  denotes  the 
number  of  variables  used. 


Translating  (2.2.9)  into  matrix  notation,  we  have 


(2.2.10) 


Mahalonobis'  method  is  one  of  the  earliest  suggested  distance 


methods,  having  numerous  applications  in  anthropometric  studies.  This 
method  has  become  a  powerful  tool  in  statistical  and  biometric  research. 
But,  unfortunately,  the  formula  (2.2.9)  (or  (2.2.10))  is  not  of  much  use 
in  practice,  since  the  computation  of  the  inverse  matrix  and  quadratic 
form  in  the  differences  of  the  mean  values  becomes  extremely  laborious 
when  the  number  of  characters  exceeds  4  or  5. 


As  the  name  suggests,  the  minimum  distance  rule  classifies  an 


observation  into  that  population  which  is  at  a  minimum  distance.  In 
case  of  ties,  one  can  make  a  randomized  decision.  Consequently,  the  so 
called  minimum  distance  rule  classifies  an  observation  X  into  tt1 


o 


or  7T 2  (two  multivariate  normal  populations  with  common  covariance 
matrix  $  )  according  as 


(2.2.11) 


' 


al  rt  >;  r  t  jqoi.  3ri£  ni  -  3  .olXqqs  *i*  ‘■'w!  11 


' 


■ 
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2.2.5  Probability  of  Correct  Classification. 


In  the  classification  procedures  discussed  in  the  preceding 
sections,  the  fundamental  criterion  for  obtaining  the  optimal  rule  was 
to  minimize  the  expected  loss  or  the  probabilities  of  misclassif ication. 
Given  a  rule  D  ,  the  probability  that  it  will  correctly  classify  an 
individual  chosen  randomly  from  the  ith  population,  is 


d  F±(x) 


fi(x) 


dy  (x) 


,  1  <  i  <  k 


Consequently,  the  probability  that  a  given  rule  D  will  correctly 
classify  an  individual  selected  at  random  from  the  mixed  population  T  , 
is 


(2.2.12) 


r(D)  =  Probability  of  correct  classification 


f±(x)  dy  (x) 


* 

gi(x)  dy (x) 


(see  Remark  2.2(ii)). 


The  rule  D  was  defined  to  be  optimal  if  it  minimized  the 
probability  of  misclassif ication.  Equivalently,  a  rule  D  is  optimal 
if  it  maximizes  the  probability  of  correct  classification  over  the  domain 

•k 

D  of  all  classification  rules.  Let 
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k 

Then  r  is  called  the  optimal  probability.  From  the  above 

k 

definition,  a  classification  procedure  D  is  optimal  if  r(D)  =  r 

k 

From  (2.2.5),  the  optimal  partition  D  is  defined  by 


(x  e  X  :  g..  (x)  =  y (x) } 


1  <  j  <  k  . 


Now, 


(2.2.14) 


*  k  „ 

r  =  r(D  )  =  l 

j-1 


g. (x)  dy (x) 


D. 


-  I 

j-i 


Y(x)  dp(x) 


D. 

J 


(see  (2.2.12)) 


Y(x)  dp (x)  , 

X 


which  is  an  expression  for  the  optimal  probability  of  correct  classifi¬ 
cation.  For  the  case  of  two  arbitrary  distributions,  we  have 


Y (x)  =  max  {g^  (x) ,g2 (x) } 


=  |  [g1(x)+g2(x)]  +  |  | g1(x)-g2(x) | 

=  |  (q1f1(x)+q2f2(x)}  +  |  |q1f1(x)“q2f2(x)| 


Thus  by  (2.2.14) 


k 

r  = 


x 


* 

|  (q1f1(x)+q2f2(x))  dp  (x)  +  |  |  q^^f  1(x)-(l-q1)  f  2  (x)  |  dp(x) 


=  1  +  1 
2  2 


| q1f1(x)-(l-q1)f2(x) |  dp (x)  . 


X 


■ 
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In  the  case  of  two  multivariate  normal  populations  with  mean 
vectors  p  ^  and  p  ^  and  common  covariance  matrix  £  ,  the  simple 
loss  function  and  equal  prior  probabilities  imply  that 

*  A 

r*  =  $(f)  . 

2 

where  is  the  Mahalanobis  generalized  squared  distance  and  $  is 

the  c.d.f.  of  standard  normal  variate. 

§2.3  Sample-Based  Classification  Rules. 

In  section  2.2,  the  classification  procedures  all  had  an  under¬ 
lying  assumption,  that  the  densities  have  a  specified  parametric  forms, 
with  all  parameters  known.  In  most  cases,  however,  the  population  para¬ 
meters  are  usually  not  known,  but  must  be  estimated  from  the  samples. 

On  the  basis  of  information  available  from  the  samples,  we  wish  to 
classify  an  individual  into  one  of  a  finite  number  of  populations.  It 
was  noted  in  Section  2.2.5,  that  the  optimal  rule,  D  ,  and  r(D)  , 

k 

the  probability  of  correct  classification  for  an  arbitrary  rule  D  e  D  , 
could  not  be  determined  unless  the  distributions  ,  (i=l , 2, . . . ,k) 

and  the  prior  probabilities  q^  ,  (i=l , 2 , . . . ,k)  ,  were  specified.  Two 
questions  arise  then: 

(i)  Not  knowing  an  optimal  rule,  how  do  we  construct  a  rule  from 
the  sample  data; 

(ii)  Given  a  rule  D  from  the  sample  data,  when  are  the  actual 

k 

probability  r(D)  and  the  optimum  probability  r  approxi¬ 
mately  equal. 


* 


-  '  44, 


' 
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These  questions  have  been  answered  in  the  following  sections: 

2.3.1  Plug-in  Rules. 

Suppose  that  the  dominating  measure  jj  is  specified,  but 
q^f ^  (1  <_  i  <  k)  are  not  specified.  Suppose  further  that  our  infer¬ 
ence  is  based  on  a  well  identified  random  sample  of  size  n  drawn  from 
the  mixed  population  T  ,  and  n^jn^,...^^  are  the  number  of  sampled 
individuals  from  tt^  ,tt^  , .  .  .  ,7]^  respectively.  Thus,  each  of  the  n^  is 
a  binomial  variable  with  expectation  n  q_^  (i  =  l,2,...,k)  .  Since  the 
densities,  f^  ,  1  <  i  <  k  ,  involve  unknown  parameters,  the  main  prob¬ 
lem  in  obtaining  "plug-in”  rules  is  to  get  reasonable  estimates  of  these 
unknown  parameters.  Generally,  the  maximum- likelihood  or  consistent 
estimates  are  used.  The  corresponding  estimates  are  substituted  in 
place  of  the  unknown  parameters  to  give  an  estimate  of  the  densities 
f .  ,  1  <  i  <  k  .  Ghurye  and  Olkin  [1969]  give  parametric  multivariate 
normal  density  estimates  that  are  pointwise  unbiased. 

/\ 

If  we  have  estimates  q^f^  ,  1  £  i  £  k  ,  then  evidently  an 

A 

intuitive  choice  of  rule  is  that  rule  D  obtained  by  substituting 

A  3 k 

q^f^  for  q^f^  in  the  expression  (2.2.5)  for  the  optimal  rule  D 

Similarly,  we  can  substitute  the  estimates  into  the  expression  (2.2.12) 

*  A 

for  r(D)  .  We  call  D  the  "plug-in"  rule.  In  most  instances,  we 
use  the  estimates 

qifi(x)  =  qi?i(x)  >  1  £  i  £  k 


where 


' 


■ 


'  K 
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(2.3.1) 


1  <  1  <  k 


A 

and  is  some  estimate  of  the  density  f ^  (1  <  i  <  k)  obtained 

by  substituting  the  estimates  for  the  unknown  parameters. 


A 

The  estimates  qi  given  by  (2.3.1)  are  quite  well  behaved. 
They  satisfy 


(2.3.2) 


~  a*s  ^ 

q^  - >  q^  as  n  00 


by  the  strong  law  of  large  numbers.  If  these  q.'s  are  known,  then  one 

/\ 

uses  q^f^(x)  =  q_^f^(x)  ,  1  <  i  <  k  .  One  also  obtains  immediately  the 
estimates 


gj (x)  =  q^f j (x)  ,  1  <  j  <  k 


Y  (x) 


A 

max  g . (x) 
i<j<k  1 


of  g (x) 


and  y(x) 


respectively. 


Throughout  the  classification  literature,  the  plug-in  rules 
seem  to  be  the  only  rule  choices  ever  considered  when  specifications  are 
incomplete.  The  general  theory  has  not  yet  been  studied  satisfactorily. 
All  one  can  do  is  to  substitute  the  estimates  of  unknown  parameters.  In 
case  of  plug-in  rules,  the  optimality  criterion  can  no  longer  be  justi¬ 
fied  except  for  large  samples  for  which  the  performance  of  the  plug-in 
rule  D  is,  in  some  sense,  close  to  that  of  the  optimal  rule  D  .  In 
fact,  due  to  sampling  variations  in  the  estimation  of  the  parameters, 

A 

the  plug-in  rule  D  is  no  longer  the  best.  The  only  justification 


' 


. 
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Anderson  [1958]  gives  for  the  use  of  plug-in  linear  discriminant  is  that, 
"it  seems  intuitively  reasonable  that  this  rule  should  give  good  results". 
The  following  are  some  special  cases: 


(i)  Anderson's  Rule: 


Suppose  we  have  samples  x^^\...,x 


(1)  . 


(2)  „  (2) 


n 


and  x.,  ,x 


1 


(2)  . 


.  .  .  ,x  '  '  ;  from  two  multivariate  normal  populations  tt,  and  tt»  res- 

n2  12 

pectively,  with  all  parameters  y^^  ,  y^2^  and  the  common  covariance 

matrix,  $  ,  unknown.  In  the  case  of  known  parameters,  the  optimal  rule 


D  was  defined  by 


D. 


=  {X  e  X 


X’J  1(y (1)-y(2))  - 


1,  (lh  (2) 

^(yv  +yv 


)f  |  1(y(1)-y(2))  >log  c} 


*  * 

°2  =^-di  • 

Since,  in  this  case,  the  parameters  are  unspecified,  the  usual  plug-in 
linear  discriminant  is  that  rule  D  ,  obtained  by  substituting  the  best 
(namely  unbiased)  estimates  of  these  unknown  parameters.  Consequently, 

A 

the  plug-in  rule,  D  ,  is  given  by 


(2.3.3) 


{x  €  X  :  X'  S  I(x<1)-X(2)) 


-  |  S-V(1)-(2))  >  log  O 


. 


' 


' 
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The  term  X*  S  is  the  linear  discriminant  based 

on  two  samples  and  is  called  Anderson's  plug-in  linear  discriminant.  The 
classification  statistic  is  denoted  by  V(x)  ;  i.e., 


(2.3.4)  V (x)  =  X' 


S_1(x(1')-x^2')) 


Anderson  [1958]  has  obtained  the  asymptotic  distribution  of 
V  .  Its  exact  distribution  is  not  known  explicitly.  He  has  shown  that 
its  limiting  distribution  approaches  the  distribution  of  U  ((2.2.8))  as 
the  sample  sizes  increase  indefinitely.  Hence,  for  sufficiently  large 
samples  from  tt^  and  tt ^  we  can  proceed  as  if  the  parameters  were 
completely  specified. 

(For  an  example  of  this  rule  of  classification,  see  Appendix 

I.) 


(ii)  Mahalanobis'  Studentized  D  : 

-  -  p 

The  plug-in  version  of  Mahalanobis'  generalized  squared 
2  2 

distance,  A  ,  is  his  studentized  D  ,  obtained  by  replacing  the 

P  P 

unknown  parameters  y^  ,  y^  ,  and  f  by  their  corresponding  'best' 

estimates.  Let  there  be  two  samples  of  sizes  n^  and  n 2  from 

2 

and  tt9  respectively.  D  is  given  by 

2  P 


(2.3.5) 


D 


2  v  v  ij 


-  I  I 

i  j 


di  dj 


where  d^  denotes  the  difference  in  the  mean  values  for  the  ith 

»  • 

variable  in  the  two  samples;  (s'1)  denotes  the  elements  of  the  inverse 
matrix  of  the  estimate  of  the  common  or  pooled  covariance  matrix. 


I  m 
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Putting  (2.3.5)  in  the  matrix  notation,  we  get 


(2.3.6) 


D  2  =  (x(i)-x^) 
P 


S  V^-x^) 


Consequently,  mimicking  what  was  done  in  section  2.2.4,  the 

plug-in  minimum  distance  rule  classifies  an  observation  X  into  tt„ 

o  1 

or  tt 2  according  as 


(2.3.7)  (X^x^)’  S  1(Xo-x(1'))  *  (X^x^)'  S  1(Xq-x(:2)) 


An  increase  in  D  due  to  the  additional  information 

P 

supplied  by  new  variables  is  not  appreciable.  A  higher  value  of  the  ratio 


R  = 


1  + 


1  + 


nln2 


D 


(n1+n2) (n1+n2~2)  p+q 


nln2 


D 


(n1+n2)  (n1+n2~2)  p 


would  indicate  that  q  new  variables  supply  some  information  (see  Rao 


[1952].) 


(For  an  example  of  this  result,  see  Appendix  I.) 


Result  2.4:  D  ,  the  Mahalanobis'  studentized  distance,  is  not  an 

-  p  - 

2 

unbiased  estimate  of  ,  the  Mahalanobis'  generalized  squared 

distance. 


2.3.2  Likelihood-Ratio  Criterion. 

Another  criterion  that  could  be  considered  in  the  classifica¬ 
tion  theory  is  the  likelihood-ratio  criterion,  first  introduced  by 


. 


. 
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Anderson  [1951].  Let  the  class  densities  be  known  except  for  some  para¬ 
meters.  For  example  the  populations  may  have  multivariate  normal  densi¬ 
ties  with  common  unknown  covariance  matrix  and  unknown  mean  vectors. 


Let  n  be  the  size  of  the  "training"  sample  and  n^  be  the 

size  of  the  sample  from  7i\  (i  =  l,2,...,k)  .  Let  nQ  be  the  size  of 

the  sample  from  tt  ,  which  is  to  be  classified.  We  shall  denote  such 

o 

a  sample  by  "CS".  Let  L(TS)  denote  the  likelihood  of  the  "training" 

sample  and  L^(CS)  denote  the  likelihood  of  CS  under  the  hypothesis 

TT  =  tt  .  ,  i  =  1 , 2 , .  .  .  ,k  . 
o  i 


Let 


L±(CS) 

h  =  sup  {T(fs)}  > 


the  supremum  being  taken  over  the  parametric  space. 

A  likelihood-ratio  rule  (LR  rule)  classifies  CS  into  tt^ 
iff 


k .  X .  =  max  (k,A.) 
1  1  l<j<k  j  3 


where  k^'s  are  non-negative  constants.  Ties  may  be  resolved  in  some 
manner. 


A  maximum-likelihood  rule  (ML  rule)  is  a  LR  rule  with  equal 
k^'s  .  Equivalently,  a  ML  rule  classifies  an  observation  Xq  into 
7T^  if  ML  obtained  under  the  assumption  that  Xq  comes  from  tt^  is 
greater  than  the  corresponding  ML  assuming  that  the  observation  Xq 
comes  from  tt^  ,  j  ^  i 


■ 


- 
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As  a  particular  case,  consider  the  classification  of  an  obser¬ 
vation,  Xq  ,  into  one  of  two  multivariate  normal  populations,  and 

tt0  ,  with  all  parameters  unknown.  Let  x-^\...,x  and 

2  l  n^ 

be  the  samples  of  sizes  n^  and  from  tt^ 


(2)  „  (2) 


1  *2 


•  •  «x 


(2) 


,  • . . , 


and  tt 2  respectively.  Considering  the  maximum  likelihood  estimates  of 

the  parameters  under  the  two  hypotheses  that  Xq  comes  from  tt^  ,  and 

X  comes  from  tt.  ,  the  ML  rule  classifies  an  observation,  X  ,  into 
o  2  o 


TT 


^  or  tt 2  according  as 


(2.3.8) 


(1+n,  1)  1(X  -x(1))’  S  1(X  -x(1)) 
1  o  o 


<  (1+n  1)  1(x  -X(2))'  S  1(X  -X(2))  +  A 

>  2  o  o 


If  |  is  known,  then  S  is  replaced  by  $  in  the  above  expression. 
(For  details  see  Anderson  [1958]  pp.  141.) 


Das  Gupta  [1965]  considers  the  above  ML  rule  and  has 
established  that  it  is  an  unbiased,  admissible  minimax  rule.  Further, 
if  the  loss  function  Z  is  continuous  such  that 

(2.3.9)  Hm  Z(y)  =  0 

y-K) 

then  ML  rule  is  the  unique  minimax  rule.  In  case  of  unknown  J  ,  Das 
Gupta  proves  that  the  ML  rule  is  unbiased,  admissible  minimax  in  an 
invariant  class  and  if  the  loss  function  l  is  continuous  satisfying 

(2.3.9) ,  then  it  is  the  unique  minimax  rule  in  the  invariant  class. 


' 


■ 
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Remark  2.5:  In  the  case  of  classification  into  one  of  two  multivariate 
normal  populations  and  Tf^  with  parameters  unspecified,  the  MD 

rule,  the  ML  rule,  and  Anderson's  rule  are  special  cases  of  the  follow¬ 
ing  rule: 

Classify  an  observation,  Xq  ,  into  tt^  or  7 1  according  as 

(2.3.10)  a(Xo-x(1))'  S_1(Xo-x(1))  *  (Xo-^(2))’  S_1(Xo-x(2))  +b 

For  example: 

(i)  a  =  (1+n^  ^(l-h^  and  b  =  (l+r^  ,  gives  the  ML 

rule. 

(ii)  a  =  1  and  b  =  0  gives  the  MD  rule. 

(iii)  a  =  1  and  b  =  -2  log  C  gives  Anderson's  rule. 

Remark  2.6:  We  have  so  far  considered  procedures  for  classifying  an 

individual  into  one  of  many  populations,  specified  completely  or  not, 
with  quantitative  data.  Sometimes  however,  the  data  is  qualitative  or 
"categoric"  (known  only  by  its  category).  In  that  case,  the  variables 
have  discrete  distributions.  The  most  familiar  instance  is  the  process 
of  medical  diagnosis  using  laboratory  tests  with  discrete  outcome  states, 
-/+  ;  -/?/+  ;  or  milky/greenish/clear /dark  etc.  (for  a  liquid).  Glick 
[1973]  considers  this  problem  at  length  and  arrives  at  sample-based  mul¬ 
tinomial  classification  rules.  He  has  also  obtained  some  results  on  the 
asymptotic  optimality  of  these  rules.  For  a  detailed  discussion  on  this 
topic,  one  is  referred  to  Glick  [1969,  1973]. 


. 


. 


■ 
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2.3.3.  On  the  Estimation  of  the  Probability  of  Correct  Classification. 


There  are  at  least  two  reasons  for  wanting  to  know  the  probab¬ 
ility  of  correct  classification,  of  a  classification  procedure.  One  is 
to  see  if  the  classification  rule  performs  well  enough  to  be  useful. 
Another  is  to  compare  its  performance  with  a  competing  rule.  In  sections 

k 

2.2.2  and  2.2.5,  we  obtained  an  expression  for  the  optimal  rule,  D  , 

k 

and  for  r  ,  the  optimal  probability  of  correct  classification,  resp¬ 
ectively,  when  the  distributions  were  completely  specified.  In  the  case 

of  unspecified  parameters,  section  2.3.1  discusses  the  choice  of  plug-in 

/\ 

rules,  D  ,  obtained  by  using  suitable  estimates  of  the  unknown  parame- 

k  /\ 

ters  in  the  expression  for  D  .  Corresponding  to  this  D  ,  r(D) 

denotes  the  probability  of  correct  classification. 


The  density  plug-in  estimator,  r  ,  of  the  optimum  probability 


k  k 

r  =  r(D  )  ,  is  defined  by 


(2.3.11) 


where 


/s  A  * 

r  =  r(D  ) 


"  I 


i=l 


D 


/N  /\ 

qi  fi 


i>±  =  {X  e  X  :  g±(x)  =  y(x) }  ,  1  <  i  < 


are  components  of  the  partition  of  D  .  The  probability  of  correct 

A 

classification  for  the  plug-in  rule  D  has  the  expression, 


(2.3.12) 


r(D) 


k  r 

l  ~  f±00  <*y(x) 

i=l  D . 


(This  is  analogous  to  (2.2.12),  for  r(D)  .) 


I 


■ 
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Mimicking  what  we  did  to  arrive  at  (2.2.13)  we  get. 


r 


y(x)  dy (x) 


In  (2.3.12),  if  we  substitute  the  estimates  of  f_^(x)  ,  we 


get  r  as  an  estimate  of  r(D)  as  well.  Thus, 


(2.3.13) 


r  =  r  (D)  =  y (x)  dy  (x) 


Thus,  the  plug-in  approach  yields  the  same  estimator,  r  , 

k 

as  an  estimate  of  both  the  optimal  probability,  r  ,  and  the  actual 

/s 

probability  of  correct  classification  for  D  ,  r(D)  .  Glick  [1972]  has 


shown  that  if  the  estimates  q^f^  are  pointwise  unbiased,  or  more 
generally  satisfy 


E(q^  f^(x))  >  q^  f^(x)  ,  1  <  i  <  k  ,  almost  all  x  e  X 


then  r  is  biased  as  an  estimate  of  either  the  optimal  probability  or 

A 

the  actual  probability  of  correct  classification,  r(D)  .  (For  proof 


see  theorem  4.1.)  Glick  also  states  general  conditions  under  which  r 

k 

is  a  consistent  estimate  of  r  .  (Theorems  4.2,  4.3  and  4.4  -  for  proofs 
see  section  4.1  of  Chapter  IV.) 

Lachenbruch  and  Mickey  [1968]  have  suggested  a  number  of 
methods  for  estimating  the  two  components  of  the  probability  of  misclas- 


?1  =  P{V(X)  <  0  |  X  c  tt1) 


sification,  namely 


f 


. 

' 
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and 

P2  =  P{V(X)  >  0  |  X  e  tt2} 

where  V(X)  is  Anderson’s  statistic  given  in  (2.3.4).  The  techniques 
may  be  divided  into  two  classes:  those  using  a  sample  to  evaluate  a 
given  discriminant  function  and  those  using  the  properties  of  normal 
distribution.  The  second  approach  depends  heavily  on  the  normality  for 
their  validity.  For  the  multivariate  case,  Lachenbruch  and  Mickey  [1968] 
comment  that  their  "method  D"  tends  to  be  "badly  biased  and  give  much 
too  favourable  an  impression  of  the  probability  of  error".  They  studied 
a  comparative  evaluation  of  all  their  suggested  methods  of  estimation 
of  P^  and  P2  on  the  basis  of  a  series  of  Monte  Carlo  experiments. 

They  concluded  that  no  one  method  is  uniformly  best  for  every  situation, 
although  D  and  R  methods  appear  to  be  relatively  poor  and  the  0 
method  does  fairly  well.  (For  a  discussion  of  these  methods,  see  Lach¬ 
enbruch  and  Mickey  [1968]  or  Kshirsagar  [1972].) 

Remark  2.7:  In  case  of  a  sample-based  classification  procedure  classi¬ 
fying  an  individual  into  one  of  two  multivariate  normal  populations,  the 

AP 

probability  of  correct  classification  is  not  $(-£-)  ,  nor  can  it  be 
obtained  in  a  similar  manner. 

2.3.4  Step-Down  Procedure. 

In  most  classification  procedures,  it  would  be  desirable  to 
find  the  magnitude  of  the  errors  committed.  Consequently,  much  of  the 
attention  is  devoted  towards  obtaining  the  exact  and  asymptotic  distri¬ 
butions  of  the  classification  statistics.  In  most  cases,  the  usual 


. 

!  '  >  1  •  ’  I 
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asymptotic  expression  for  error  is  an  underestimate  of  the  actual  error 
(see  Srivastava  [1973]).  Srivastava  [1973]  proposes  the  "step-down" 
procedure  when  the  variates  can  be  arranged  according  to  their  importance 
on  a  priori  grounds. 

Let  the  two  populations  have  multivariate  normal  densities 
with  the  same  covariance  matrix,  $  .  The  classification  is  carried  out 
on  the  basis  of  the  marginal  univariate  distribution  of  the  first  vari¬ 
ate,  on  the  conditional  univariate  distribution  of  the  second  variate 
given  the  first,  on  the  conditional  univariate  distribution  of  the 
third  variable  given  the  first  and  the  second,  and  so  on.  Let 


X  —  [x1, . . 

X(i)  =  tXl’X2 

We  define  Y^,Z|^,|J 
Let  the  top  left-hand 
.  Let 


. ,x  ]  be  the  vector  to  be  classified. 
P 

,  •  « •  ,  x .  ]  • 


(j) 

(i) 

ixi 


(j=l,2)  similarly  for  the  two  populations, 
submatrix  of  $  =  [  (cr .  . )  ]  ,  be  denoted  by 


*i  =  t± 


-1 


1,  i+1 


2 ,  i+1 


i,i+l 


,  i  1,2,... ,p 


2  i  L+l 

and  °i+i =  nrr 

$  =0  and  I i  I 

o  1  o' 


i  =  0,1,2, ... ,p-l  with  the  convention  that 
2 

1  so  that  .  We  call  ,  the  ith 


. 


<  Xi  =  'X 

j  I 

] 

. 
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2 

order  step-down  regression  coefficient  and  (T.,,  ,  the  ith  order 

l+l 

step-down  residual  variance.  Let 


n 


(j)  _  ,,(i) 


(j) 


i+i  -  wi+i  ■  ^ (i) 


6. 


i  =  0,1,2,... ,p-l 

j  =  1,2  . 


Then  under  the  condition  that  is  fixed*  the  conditional  distribution 

(1)  2 

of  Yi+1  is  normal  with  mean  r)i+i  +  Y^  3i  and  variance  a1+^  . 

The  distributions  of  z±+1  given  and  xi+^  given  X^  are 

similar . 


A 

Let  3  be  the  usual  (replacing  the  unknown  parameters  by 
'best*  sample  estimates)  estimator  of  3  •  Let,  for  i  =  0,1, 2, . . . ,p-l  , 


(2.3.14) 


Xi+1  ‘  X(i) 
yi+l  "  Y(i) 
Zi+1  "  Z(i) 


A 


Then  the  step-down  procedure  classifies  an  individual  with  measurements 

X  into  7T  if  for  all  i  =  l,2,...,p 
~  1 


(2.3.15) 


r*j  rv  /v  2,  /v  rv 

Xi^yi“Zi-)  ■  2  (y±+Z±)  ^±~z±>  >  0  » 


and  to  tt 2  if  for  all  i  =  l,2,...,p  ,  <  0  ;  otherwise  it  is 

assigned  to  neither  tt^  nor  •  (For  an  expression  for  probability 

of  misclassif ication  for  this  procedure,  see  Srivastava  [1973].) 


Remark  2.8:  In  the  step-down  procedure,  an  individual  may  not  be 
classified  at  all  to  any  of  the  two  populations  tt^  ,  . 


This  is  one 


. 
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of  the  features  of  this  procedure,  for  it  is  better  not  to  assign  to  any 
one  of  the  two  in  the  absence  of  sufficient  evidence.  The  procedure  is 
clearly  not  invariant  under  permutation  of  the  variates,  and  should  be 
used  only  when  the  variates  can  be  arranged  on  a  priori  grounds. 


. 


CHAPTER  III 


Nonparametric  Classification 

In  Chapter  II,  we  discussed  some  major  parametric  rules  of 
classification  and  the  associated  probabilities  of  misclassif ication. 
These  techniques  assume  the  existence  and  knowledge  of  the  underlying 
probability  densities.  In  practice  however,  the  forms  of  the  underlying 
distributions  are  seldom  known  and  one  is  often  confronted  with  the 
problem  of  devising  appropriate  classification  rules,  applicable  for  a 
wider  class  of  distributions,  whose  structures  are  not  expressible  in 
simple  parametric  forms.  In  such  cases,  the  use  of  parametric  procedures 
is  subject  to  criticism  regarding  its  appropriateness  and  validity.  For 
such  situations,  one  uses  the  so-called  "nonparametric"  or  "distribu¬ 
tion-free"  methods,  which  are  the  subject  of  this  chapter. 

§3.1  Statement  of  the  Problem. 

The  problem  is  to  classify  units  into  a  specified  number  of 
populations  on  the  basis  of  a  set  of  observations  on  these  units,  with 
all  population  distributions  F^'s  (i-1, 2, . . . ,k)  unspecified.  Some 
assumptions,  however,  are  needed  for  constructing  discriminant  rules,  for 
example,  the  existence  of  densities,  a  unique  mode  etc.  In  case  of  non¬ 
parametric  classification  procedures,  the  main  emphasis  is: 

(i)  to  study  the  asymptotic  behaviour  of  the  rules  (e.g., 
consistency,  efficiency), 

(ii)  to  obtain  suitable  bounds  for  the  probability  of  correct 
classification. 
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§3.2  On  the  Estimation  of  the  Probability  Density  Function. 

A  basic  and  important  problem  in  nonparametric  classification 
techniques  is  the  estimation  of  the  assumed  probability  density  func¬ 
tion  and  its  mode.  Discriminant  criteria  are  then  based  on  the  esti¬ 
mates  of  these  assumed  densities.  There  are  two  forms  of  density  estima¬ 
tion  -  parametric  and  nonparametric. 

3.2.1  Nonparametric  Density  Estimation. 

If  the  functional  form  of  the  density  is  known  but  depends 
upon  a  finite  number  of  unknown  parameters,  the  usual  method  of  estima¬ 
tion  would  be  to  obtain  suitable  estimates  of  these  unknown  parameters 
and  plug-in  these  estimates  in  place  of  unknown  parameters  giving  an 
estimate  of  the  parametrized  density.  This  case  was  discussed  in 
Chapter  II,  to  obtain  the  so-called  "plug-in"  rules  of  classification. 


.  . .  , 


finite  measure  y.  Fix  and  Hodges  [1951]  were  the  first  who  considered 
nonparametric  density  estimation  in  connection  with  nonparametric  dis¬ 
crimination*  Parzen  [1962]  and  later  Cacoullos  [1966],  who  generalized 
Parzen’s  work  to  the  multivariate  case,  developed  a  class  of  nonpara¬ 
metric  density  estimates  having  the  form 


k(^)  d  F  (y)  ,  1  <  i  <  k 

hi  —  — 


(3.2.1) 


1  <  i  <  k 


. 


-• 
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where  X„  is  the  j  th  -  sample  observation  from  ti\  ,  F^  is  the 
empirical  distribution  function  of  the  n^  individuals  sampled  from 


(i  =  l,2,...,k)  ,  k(x)  is  a  bounded  Lebesgue  integrable  functi 
on  (-oo9oo)  such  that 


on 


lim  | x  k(x) |  =  0 
x-x» 


(3.2.2) 


*oo 

k(x)  dx  =  1 

— OO 


k 

and  h  =  h(n)  ,  where  n  =  Z  n.  ,  is  a  non-negative  sequence  satisfying 

i=l  1 


(3.2.3) 


lim  h(n)  =  0 
n-*» 


Functions  k(x)  of  the  above  type  satisfying  (3.2.2)  are 
called  ’weighting'  or  'Kernel'  functions.  It  should  be  noted  that  the 
choice  of  k(x)  is  very  important,  and  to  a  large  extent  determines 

A 

the  properties  of  f^(x)  •  One  simple  example  of  a  kernel  function  is 


1 

2 


0 


1*1  £1 

|  x  |  >  1 


For  different  choices  of  kernel  functions,  see  table  1  of  Parzen  [1962]. 


This  definition  includes  the  special  cases  of  the  form 
„  F  (x+h)  -  F.(x-h) 

f  (x)  =  — - — - - -  ,  1  <  i  <  k 

1  hi  -  - . 


t 
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where  |h|  -*■  0  as  n  -*  °°  .  The  estimates  suggested  by  Parzen-Cacoullos 
are  also  called  "Fixed  window"  estimates.  If,  in  addition  to  (3.2.3), 
h  =  h(n)  satisfies 

(3.2.4)  lim  n  h(n)  =  oo  , 

n-*» 

then  Parzen  [1962]  proved  that  these  density  estimates  are  consistent. 

He  also  proved  the  asymptotic  normality  of  the  estimates.  (For  details 
see  Parzen  [1962].)  Using  Parzen's  density  estimates  (3.2.1),  and 
added  conditions,  Glick  [1969,  Theorem  6d,  pp.  72]  proves  the  consistency 
of  the  plug-in  estimator  r  of  the  optimum  probability  r 

* 

Suppose  D  is  a  Bayes  rule  with  respect  to  a  prior  distribu- 

A 

tion,  assuming  the  densities  in  the  k  populations  are  known.  Let  D 

* 

be  the  plug-in  rule.  By  remark  2.2(i)  p(D  )  denotes  the  Bayes  risk 

* 

of  the  optimal  rule  D  .  Let  R(D)  denote  the  Bayes  risk  of  the 

A 

plug-in  rule  D  .  Van  Ryzin  [1966]  introduces  the  notion  of  "Bayes  risk 
consistency",  defined  by  the  following. 

^  * 
Definition  3.1.  The  rule  D  is  Bayes  risk  consistent  (BRC)  with  D 

if 

P[R(D)  -  p(D*)  >  e]  +  0 

as  the  sample  sizes  in  the  training  sample  tend  to  00  . 

With  respect  to  this  notion,  using  Parzen-Cacoullos  density 
estimates,  Van  Ryzin  [1966]  studied  the  asymptotic  optimality  of  sample- 
based  classification  rules.  For  related  results  see  Van  Ryzin  [1965]  and 


. 
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section  4.2  of  Chapter  IV.  Van  Ryzin  [1969]  gives  conditions  for  the 
pointwise  'almost  sure'  convergence  of  the  fixed  window  estimates. 
("Potential  functions"  in  the  Pattern  recognition  theory  are  synonyms 
for  the  "Kernel"  of  the  fixed  window  density  estimation  theory.) 

An  alternative  nonparametric  approach  for  estimating  multivar¬ 
iate  densities  has  been  proposed  by  Loftsgaarden  and  Quesenberry  [1965]. 
Let 


Sd00  =  (y  e  X  :  I  |y-x|  I  <  d} 


and  denote  the  volume  of  this  hypersphere  by 


Ad,x  =  y(Sd(x))  • 


Let  lc  be  a  non-decreasing  sequence  of  positive  integers  such  that 
11  kn 

k  oo  ,  but - *■  0  as  n  00  .  Let  d.  (x)  be  the  distance  from  X 

n  n  k 

ni 

to  the  k  -th  closest  point  among  n.  sampled  individuals  from  the 
n.  i 

l 

density  f  (i  =  1, 2, . . . ,k)  .  Then  the  Loftsgaarden  and  Quesenberry 
estimate  of  f^(x)  is 


(3.2.5) 


k  -  1 

ni 

fi(x)  =  7Ta - 

1  i  dk  (x).x 

ni 


i  l,2,...,h 


In  contrast  to  the  fixed  window  estimates,  these  estimates  given  by 
(3.2.5)  are  called  "variable  window"  or  "fixed  view".  Click  [1969] 

A  ~ 

proved  that  if  is  a  Lebesgue  measure  and  qi  f^x)  =  ~  f^x)  , 

where  each  f.  ,  1  <  i  <  k  ,  is  of  the  form  (3.2.5),  then  the  plug-in 


■ 


' 
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/s  & 

estimator  r  is  a  consistent  estimate  of  the  optimum  probability  r 
(For  proof  see  Theorem  4.2.) 

In  fact,  there  are  several  other  papers  in  the  literature 
dealing  with  various  methods  for  estimating  probability  density  functions 
and  their  properties.  For  more  detailed  references  in  this  connection, 
see  Cacoullos  [1973],  Glick  [1972]  and  Patrick  [1972].  In  particular, 
Patrick  [1972]  gives  an  excellent  account  of  estimation  by  the  potential 
functions  methods  and  stochastic  approximation  method  -  a  method  of 
searching  for  a  parameter  vector  which  optimizes  a  prescribed  criterion. 

A  final  remark  on  nonparametric  density  estimation:  For  univariate 
unimodal  densities,  B.L. S.P.  Rao  [1969]  shows  that  the  maximum  likeli¬ 
hood  density  estimate  is  "the  slope  of  the  concave  majorant  of  the 
empirical  distribution"  and  that  this  estimate,  too,  is  consistent 
(converges  pointwise  in  probability).  (Maximum  likelihood  density 
estimation  can  also  be  found  in  Wegman  [1970a,  1970b].  In  general,  as 
remarked  by  Wegman  [1972]  the  maximum  likelihood  density  estimates  do 
not  exist,  but  with  some  appropriate  type  of  restriction  on  the  class  of 
densities  from  which  the  density  may  be  selected  a  maximum  likelihood 
estimate  over  that  class  may  exist.) 

§3.3  Classification  Rules. 


3.3.1  Nearest  Neighbor  Rule. 

Throughout  this  section,  simple  loss  structure,  namely, 


(see  (2.2.1)) 


* 


■ 
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is  assumed  and  formulation  (ii)  of  section  2.1,  with  q_^'s  >  1  <  i  £  k  , 
unknown  is  considered. 

/\ 

In  section  2.3.3,  we  considered  density  plug-in  estimator  r 
of  the  optimum  probability  of  correct  classification  r*  .  Three  prob¬ 
lems  arise  in  this  estimation: 

(i)  Such  an  estimate  is  almost  always  over  optimistic, 

(ii)  one  should  always  suspect  the  validity  of  ah  assumed  parametric 
model, 

(iii)  in  more  general  situations  it  is  quite  difficult  to  compute 
these  probabilities  exactly,  even  if  the  probabilistic  stuc- 
ture  is  completely  known. 

In  order  to  overcome  some  of  these  drawbacks  Glick  [1969] 
introduces  the  notion  'counting'  estimate  of  the  probability  of  correct 
classification,  and  gives  a  classification  rule  related  to  this  notion. 
Let  D  be  any  classification  procedure,  namely,  an  ordered-partition 
<D^,D2» . . . ,D^>  of  the  sample  space  X  .  Given  a  correctly  classified 
random  sample  of  size  n  from  the  mixed  population  T  ,  the  proportion 
of  sampled  individuals  who  would  be  correctly  classified  by  D  is  the 
most  natural  estimate  of  the  rule's  actual  probability  of  correct  class¬ 
ification.  This  estimate  is  known  as  counting  or  empiric  estimate  and 
is  denoted  by  r(D)  .  Thus, 

~  1 

r(B)  =  —  (//  of  sampled  individuals  classified  correctly  by  D) 
n 

=  —  (//  of  tt\  sampled  individuals  classified  correctly  by  D) 
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=  —  (//  of  7i\  sampled  individuals  with  X  e  D  ) 


1  k 

=  1  l 

n  i=l 


D. 

l 


d^ni  Fi^x^ 


(3.3.1) 


k  r 

=  l 

i=l 


D. 

l 


q .  d  F . (x) 
l  i 


(see  (2.3.1)). 


The  counting  function  r  resembles  the  density  plug-in  esti- 

A 

mate  r  ,  but  the  empiric  approach  differs  in  an  important  way  from  the 
plug-in  approach,  viz,  no  restriction  is  placed  on  the  distributions 

dFi 

(i  =  1,2,  ...,k)  ,  the  densities  f  ^  =  — —  are  of  no  importance  and  y 
need  not  be  specified.  For  these  reasons  the  nearest  neighbor  rule,  to 
be  described  below,  is  termed  a  nonparametric  classification  procedure. 

A  rv 

Also  note  that,  unlike  r  ,  the  counting  estimate  r  is  an  unbiased 
estimate  of  r(D)  for: 

~  1  k 

E(r(D))  =  —  £  E  (#  of  sampled  7T.  individuals  with  X  e  D.  ) 

i=l 

1  s 

=  —  )  q,  n  P  (X  e  D.  /  X  is  drawn  from  tt.  ) 
n  Mi  i  i 


(3.3.2)  =  r(D)  (see  (2.2.12)). 

Mimicking  the  optimality  criterion,  one  would  desire  to 
have  a  classification  procedure  that  maximizes  the  counting  function  r  . 
Since  r(D)  <  1  for  any  discriminant  D  ,  consequently  .  sup  r(D)  <  l  a.s 

7>  mm. 
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The  equality  sup^r(D)  —  1  is  attained  by  the  so  called  nearest  neigh- 

DeP 

bor  rule  (NN  rule)  D  ,  which  assigns  an  unidentified  individual  from 
the  mixed  population  F  to  the  category  of  a  nearest  correctly  classi¬ 
fied  sample  observation. 

Definition  3.2:  We  call  x  e  (x.,  ,x„, .  .  .  ,x  }  ,  a  nearest  neighbor  to 
-  n  1  2  n 

x  ,  if 

min  d(x, ,x)  =  d(x  ,x) 
l<i<n  1  n 

The  distance,  in  general,  may  be  other  than  the  usual  euclidean  distance. 

The  NN  rules  are  ones  among  a  broad  category  of  '’good”  data 

/\ 

dependent  rules,  distinct  from  the  plug-in  rule  D  discussed  in  section 

2.3.1.  The  first  formulation  of  the  NN  rule  and  contribution  to  the 

analysis  of  its  properties  were  made  by  Fix  and  Hodges,  as  early  as  1951. 

Subsequently,  these  rules  have  been  investigated  by  Cover  and  Hart 

[1967]  and  Cover  [1968].  Variations  on  this  theme  include  the  v  - 

nearest  neighbor  rule,  which  assigns  an  unidentified  individual  to  a 

subpopulation  with  a  plurality  among  the  v  measurements.  Cover  and 

Hart  [1967]  have  shown  among  the  class  of  all  v  -  nearest  neighbor  rules, 

the  simple  nearest  neighbor  rule  is  admissible.  They  prove  the  conver- 
•  • 

gence  r(D)  -►  r^N  with  probability  one,  and  for  k  =  2  classes,  the 
limit  is  bounded  by  (see  Cover  and  Hart  [1967]), 


w 


1  >  r  >  r 
-  -  NN 


*  ■ k 

>  l-2r  (1-r  )  >  j 


i  ' 


' 

■ 


* 


44. 


*  *  2  * 

and  r  =  r  iff  r  =  —  or  r  =  1  ,  i.e.  in  the  two  extreme  cases 
NN  2 

of  complete  certainity  and  complete  uncertainity ,  the  nearest  neighbor 
actual  probability  of  correct  classification  equals  the  optimum  probab¬ 
ility.  It  is  in  these  cases,  or  approximations  to  it,  that  the  nearest 
neighbor  rule  is  most  useful.  Later  in  1968,  Cover  [1968]  studied  the 
rate  of  convergence  of  the  Bayes  risk  of  their  nearest  neighbor  rule. 

An  excellent  account  of  nearest  neighbor  rules  is  given  in  Patrick  [1972]. 

Finally,  we  must  specify  means  of  resolving  the  tie,  for 
example,  the  rule  may  be  modified  to  decide  the  most  popular  category 
among  the  ties  or  assigning  to  that  population  with  lowest  subscript. 

Glick  [1969]  remarks  that  "A  NN  rule  seems  most  reasonable  and  useful 
when  the  probability  of  ties  is  zero”.  However,  Cover  and  Hart  [1967] 
claim  that  their  results  are  true  even  for  those  cases  in  which  the  ties 
occur  with  non-zero  probability.  This  assertion,  however,  seems  to 
need  some  mathematical  justification. 

Remark  3.1:  If  the  probability  of  ties  is  zero,  then  with  probability 

•  • 

one,  the  rule  D  classifies  correctly  all  n  sampled  individuals,  i.e. 
r(D)  =  1  .  Hence,  Glick  [1969]  comments  that  "the  counting  estimate  of 
r(D)  ,  the  simple  NN  rule’s  probability  of  correct  classification  is 
grossly  biased  and  unreasonable".  Due  to  this  fact,  further  methods  of 
estimation  of  r(D)  are  suggested  in  the  literature.  One  of  such  meth¬ 
ods  is  deletion-counting  method  of  estimation,  which  is  not  dealt  with 
here.  One  is  referred  to  Glick  [1969]  for  further  details. 

(For  an  example  of  the  NN  rule,  see  Appendix  I.) 
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3.3.2  Minimum  Distance  Classification  Rule. 


Das  Gupta  [1964]  suggested  the  so-called  minimum  distance 
classification  rule  for  the  above  nonparametric  classification  problem. 
Let  X(pxl)  be  a  random  vector  from  one  of  the  populations  ti\ 

(i  =  0,l,2,...,k)  with  distribution  functions  F^  (i  =  0,l,2,...,k)  . 
The  F^’s  are  completely  unspecified  except  that  F^  =  F^  for  exactly 
one  value  of  i  ,  (i  =  l,2,...,k)  and  F^’s  (i  =  l,2,...,k)  are  all 
distinct.  Let  D  denote  the  decision  space  (d..,...,d  )  where  d. 

-L  1C  1 

denotes  the  decision  F0  =  F-^  ,  i  =  l,2,...,k  .  Let  X  be  a  vector  of 
sample  observations.  Then  a  classification  rule  <p  =  , ...  »4>-jc)  is 

a  k  -  dimensional  vector  valued  measurable  function  of  X  such  that 


(3.3.3) 


0  <  <f>.  (X)  <  1 


I 

i=l 


=  1 


V  X  e  X 


and  (j)_^(X)  denotes  the  probability  of  taking  the  decision  d^  on 
observing  X  =  x  . 

Definition  3.3:  The  minimum  distance  rule,  (j)^^  ,  based  on  a  p  -  var¬ 
iate  distance  function  d  (arbitrary  distance),  is  defined  by 


(3.3.4) 


1  if  d  .  =  min  doj 
oi  J 


l<j<k 


0  otherwise 


/\ 


/\ 


A 


for  i  =  1,2,... ,k  ,  where  d  .  =  d(F  ,F.)  ,  1  <  i  <  k  ,  F  being  the 

oi  o  i  —  —  o 

empirical  distribution  function  of  nQ  individuals  sampled  from  tt^  . 


■ 

- 
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Definition  3.4:  A  distance  function  d  between  two  p  -  variate  distri¬ 
bution  functions  is  said  to  be  consistent  if,  given  any  e  >  0  ,  e'  >  0  , 
there  exists  a  number  N  such  that  for  n  >  N 

(3.3.5)  P[d(Fn,F)  >  e|F]  <  e' 

A 

where  F^  is  the  sample  distribution  function,  based  on  a  random  sample 
of  size  n  from  a  p  -  variate  population  with  distribution  function  F  . 

If  (3.3.5)  holds  uniformly  for  all  F  e  B  ,  a  subclass  of  all 
p  -  variate  distribution  functions,  then  D  is  said  to  be  uniformly 
consistent  (B) . 

Definition  3.5:  A  distance  function  d  is  called  the  Kolmogorov- 
distance  when 

(3.3.6)  d(F,G)  =  sup  Jf(x)  -  G(x)| 

— co<  x<co 

Following  the  definition  of  ,  in  (3.3.4)  ,  let 

(3.3.7)  rii(d)  =  P[<(’1(d)(X)  -  1  |  Fq  -  F±]  ,  i  =  1,2 . k 

and  let 

(3.3.8)  fd(n,Y,F)  =  P[d(Fn>F)  <  y|f] 

With  respect  to  the  consistency  (uniform)  notion  of  a  distance 
function  d  ,  Das  Gupta  [1964]  has  proved  that  the  minimum  distance 
classification  rule  (J)^^  defined  by  (3.3.4)  is  consistent  (uniform)  • 


' 


47. 


i.e.,  r^(d)  "*■  1  as  ru  -*■  00  ,  i  =  l,2,...,k  if  the  distance  function 

d  is  consistent  (uniform).  He  further  extends  the  result  to  the  case 
when  d  is  the  Kolmogorov-dis tance  defined  by  (3.6).  Das  Gupta  [1964] 
obtained  a  lower  bound  for  the  probability  of  correct  classification  for 
such  rules  given  by: 


(3.3.9)  ri:L(d)  >  fd(n1,|,F1)  Vvf’V'V  (1=1’2) 


where  d(F^,F2)  >  3  >  0  ,  and  when  d  is  the  Kolmogorov-dis tance 

2  -n.  I2  /32 

(3.3.10)  r .  .  (d)  >  n  (1  -  e  1  }  ,  i  -  1,2  , 

11  i=0  *12 


where  ^12  d  ^1*^2^*  ^or  Proofs  of  these  assertions,  see  section  4.3 
of  Chapter  IV.) 


3.3.3  Classification  Rules  Based  on  Ranks. 

The  idea  of  using  the  rank-statistics  for  devising  classifica¬ 
tion  procedures  was  suggested  by  Das  Gupta  [1964].  He  proposed  the  fol¬ 
lowing  rule  based  on  the  Wilcoxon-Statistic  for  the  classification  of 
an  individual  into  one  of  two  univariate  populations. 

As  in  section  3.3.2,  let  X  be  a  random  vector  from  a  popula¬ 
tion  tt  with  distribution  function  F  .  which  is  one  of  the  two 
o  o 

populations  (i  =  1,2)  with  continuous  distribution  functions  F^ 

(i  =  1,2)  respectively.  The  properties  of  the  Wilcoxon-Statistic  for 

the  discrete  case  have  not  been  fully  investigated  so  far.  Let 

(x.,x2,...,x  )  ,  (y1,y2,...,y  )  >  (z^, z2, . . . ,z  )  be  random  samples 

o  1  2 

of  sizes  nQ  ,  n^  ,  n2  from  populations  ,  ir2  respectively. 


■ 


' 
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Def ine 

//  of  pairs  (x.,y.)  with  x.  <  y.  , 

^  J  J 

(1  =  l,2,...,no  ;  j  =  1,2,...,^) 

#  of  pairs  (x.,z,)  with  x .  <  z  , 

1  K  1  R 

( i  —  1 , 2 , .  .  .  ,nQ  j  k  —  1,2, .. .  ,n£)  • 

The  proposed  classification  rule,  based  on  these  statistics 
u  and  v,  is  defined  by:  Decide 

(3.3.11)  Fq  =  F1  if  ]u  -  ||  <  |v  -  |] 

decide  F  =  otherwise.  (3.3.11)  is  equivalent  to:  Decide 

o  2 

F  =  F1  if  (u-v)(u+v-l)  <  0 
o  1 

Das  Gupta  [1964]  proved,  in  his  paper,  that  the  above  classifi¬ 
cation  procedure  based  on  the  Wilcoxon-S tatistic  is  consistent.  Kanazawa 
[1974]  proposes  the  extension  of  the  rule  for  the  multivariate  and  mul¬ 
tisample  case,  showing  its  consistency.  When  the  observations  are  cor¬ 
rectly  classified,  he  has  shown  that  his  classification  statistic  is 
asymptotically  distributed  according  to  the  chi-square  distribution  with 
p  (number  of  variates)  degrees  of  freedom.  For  details  see  Kanazawa 
[1974]. 

Kinderman  [1972]  proposed  a  class  of  rules  based  on  linear  rank 
statistics  as  follows:  Suppose  n  observations  are  available  from  each 


u  =  -  X 

n  n- 
o  1 


v  =  -  X 

n  n0 
o  2 


/ 


. 
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of  the  three  populations  1tq,7T1,‘IT2  *  Let 


T  .  =  n  1  7  EL..  L.  . 

nj  i“1  Ni  ji 


N 


N  =  3n.  Define 

,  j  =  0,1,2, 


where  E„. 

Ni 


is  a  sequence  of  scores  and 


if  the  ith  ordered  observation  in  the 
pooled  sample  is  from  n\ 

0  otherwise. 


Kinderman's  rule  classifies  the  observations  from  tt  into 

o  1 


and  only  if 


if 


2T 

no 


He  assumed  that  the  distribution  in  it ^  differs  from  that  in  tt^  by  a 
positive  shift  in  translation.  He  computed  the  relative  asymptotic 
efficiency  of  this  rule  to  the  rule  obtained  by  replacing  T  by  the 
corresponding  sample  mean  of  the  observations  from  and  specialized 

his  results  to  "Wilcoxon  rank-sum"  scores  and  "normal"  scores.  Govind- 
arajulu  and  Gupta  [1972]  consider  similar  linear  rank  statistics  for 
the  several  population  case  when  the  sample  sizes  may  be  different.  For 
lack  of  space,  the  details  of  these  papers  are  omitted.  Interested 
readers  are  referred  to  Kinderman  [1972]  and  Govindarajulu  and  Gupta 


[1972]. 
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3.3.4  Best-count  Rules: 

We  discussed  in  section  3.3.1  the  "nearest  neighbor"  rules 
/>*/ 

which  maximize  r  ,  the  counting  estimator  of  the  probability  of  correct 

■k 

classification,  over  the  domain  D  of  all  discriminants.  These  are 
not  the  only  interesting  ones  related  to  the  counting  function  r  .  In 
this  section,  we  shall  discuss  another  of  such  rules  known  as  "Best- 
count"  rule  -  a  rule  which  optimizes  certain  specified  criteria  in  a 
given  class.  A  systematic  study  of  this  concept  is  due  to  Glick  [1969]. 
Best-count  discriminants  generalize  sample-based  "best"  linear  or  quad¬ 
ratic  discriminants. 

Consider  the  set-up  as  in  formulation  (ii)  of  section  2.1. 

■k 

Let  D  c  D  ,  the  collection  of  all  discriminants,  be  arbitrary  but  a 
completely  specified  collection  of  discriminants  D  .  Then 

(3.3.12)  rD  =  sup  r(D)  , 

D  eD 

is  called  the  restricted-optimum  probability  of  correct  classification. 

Definition  3.6:  A  classification  rule  D  e  D  is  said  to  be  D  - 

optimal  (or  restricted  optimal  for  the  collection  D)  if 

(3.3.13)  r(D)  =  rD 

(In  general,  there  need  not  exist  such  a  restricted  optimum 

rule. ) 

Remark  3.2:  (i)  t°  =  sup  r(D)  <  sup  r(D)  =  r*  . 

D  eD  D  eD* 


. 
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(ii)  If,  among  the  classification  rules  which  are  optimal  in  the 

unrestricted  sense,  there  exists  one  which  is  a  member  of  D  , 
■  then 


*  * 
r  =  r(D  ) 


D 

r 


/V 


A  sample-based  rule  D  e  D  is  called  a  minimum-mis classif ica- 
tion  discriminant  or  best-count  discriminant  if  it  maximizes  r 
(defined  by  (3.3.1)),  over  all  D  e  D  ,  i.e.  a  best-count  discriminant 


D  e  D  satisfies 


(3.3.14)  r(D)  =  sup  r(D)  =  r^  , 

D  eD 

and  D  is  called  a  best-count  rule  for  the  collection  D  . 

Since  empirical  distributions  are  simple  functions,  there 

necessarily  exists  a  sample-based  rule  (not  usually  unique)  which  maxi- 

~  & 

mizes  the  function  r  over  all  the  rules  D  e  D  c  D  .  It  can  be  noted 

•  • 

from  the  above  definition  that  the  nearest  neighbor  rule  D  ,  discussed 
in  section  3.3.1,  is  a  best-count  discriminant  for  the  collection  D 
of  all  discriminants.  It  was  seen  in  section  3.3.1  that  for  any  discrim- 

/n; 

inant  D  ,  r(D)  is  an  unbiased  estimate  of  r(D)  .  Using  this 
unbiasedness  for  a  fixed  D  ,  Glick  [1975]  has  proved  that  E(r  )  = 

r>s/ 

E(r(D))  r  >  r(D)  .  He  has  also  proved: 

(i)  counting  function  r  converges  to  actual  probability  of 
correct  classification  uniformly  over  D  e  D  ,  i.e. 

sup  |r(D)-r(D)|  a~  ■-*->  0  as  n  -*■  00  ,' 

DeD 


-t  =  (V  r  =  C  t 


(H.l  .€) 


. 


, 
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provided  F^'s  are  absolutely  continuous  with  respect  to  the 
Lebesgue  measure. 

(ii)  The  best-count  discriminant  (or  the  sequence  of  such  discrimi¬ 
nants  as  sample  size  n  -*■  00 )  is  Bayes  risk  strongly  consistent. 

He  further  extends  these  results  to  prove  that  r(D)  ->  r  ,  the  unre¬ 
stricted  optimum  probability,  in  case  of  the  classification  into  normal 
densities  with  estimated  mean  vectors  and  common  covariance  matrix.  (For 
proofs  of  these  assertions  on  best-count  discriminants  see  section  4.4 
of  Chapter  IV.) 

As  a  final  remark  on  these  best-count  discriminants  it  should 
be  mentioned  that  the  construction  of  the  Fisher-Anderson  linear  discrim¬ 
inant  was  explicit  in  its  definition,  which  is  not  the  case  with  the 
best-count  discriminants'  definition.  For  arbitrary  rule  collection, 
even  with  k  =  2  there  seems  to  be  no  general  method  for  constructing 
best-count  rules  (other  than  by  exhaustive  trial  and  error).  Glick  [1975] 
remarks  that  "no  general  construction  of  a  best-count  linear  discrimi¬ 
nant  is  yet  known  when  the  sample  observations  from  the  two  populations 
can  not  be  separated  by  a  hyperplane" . 

3.3.5  Rules  Based  on  Tolerance  Regions. 

The  idea  of  using  tolerance  regions  for  the  classification 
problem  was  first  suggested  by  Anderson  [1966],  For  the  univariate  case, 
he  considers  some  variations  of  NN  rules,  and  in  the  multivariate  case, 
vector  observations  may  be  "ranked"  (using  them  to  define  blocks)  and 
then  a  univariate  method  can  be  applied.  Another  method  suggested  by 


, 

. 


. 
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Anderson  [1966]  is:  Use  the  pooled  training  sample  to  construct 

"blocks".  An  observation  is  classified  into  tt  ,  if  the  blocks  to  which 

1 

X  belongs  is  defined  by  the  majority  of  observations  from  7i\  .  Example, 
for  the  two  population  case,  construct  two  sets  of  blocks  separately 
based  on  the  observations  from  tt^  and  tt  ^  .  Let  and  B2  be  the 

blocks  in  the  two  sets  which  contain  X  .  Consider  the  number  of  obser¬ 
vations  from  tt ^  in  B^  and  the  number  of  observations  from  tt^  in 
B^  ,  and  classify  X  according  to  the  larger  number.  The  notion  of 
tolerance  region  is  quite  important  because  the  expected  probability  in 
the  region  is  equal  to  the  number  of  samples  (=k)  divided  by  k+1  . 
Different  methods  have  been  suggested  for  the  construction  of  tolerance 
regions.  For  some  details  see  Patrick  [1972]. 

Quesenberry  and  Gessaman  [1968]  also  suggested  the  use  of 

tolerance  regions  for  the  k  -  population  nonparametric  classification 

1c 

problem  with  2  -1  decisions  (instead  of  k  decisions)  by  introducing 
the  idea  of  reserve  judgment.  For  details  interested  readers  are 
referred  to  their  paper. 

Remark  3.3:  When  a  statement  regarding  the  probability  of  a  certain 
statistical  decision  rule  remains  valid  for  every  member  in  a  given 
family  of  distributions,  it  is  termed  as  a  "Distribution-free"  rule  with 
respect  to  that  family.  However,  in  contrast  to  the  problems  of  hypoth¬ 
esis  testing  or  estimation,  nonparametric  classification  techniques  are 
not  really  distribution-free.  This  is  because,  regardless  of  the  name 
(parametric,  distribution-free  or  nonparametric),  the  resulting  discrim¬ 
inant  function  is  defined  by  a  set  of  parameters  which  must  be  determined 


- 


- 
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from  the  existing  prior  information.  Consequently,  we  could  say  that 
all  techniques  are  somewhat  parametric  in  nature  (Andrews  [1972],  pp. 
104). 


CHAPTER  IV 


Mathematical  Proofs 

In  Chapters  II  and  III,  we  studied  various  major  classifica¬ 
tion  rules  (parametric  and  nonparametric) ,  discussed  in  the  literature. 
The  study  also  included  the  sample-based  classification  rules,  the 
estimates  of  probability  of  correct  classifications,  and  mathematical 
assertions  on  bias,  consistency  and  asymptotic  optimality  of  these  rules. 
In  this  chapter,  we  give  mathematical  proofs  of  some  of  these  assertions. 
Let  us  recapitulate  the  different  notations  that  have  been  used: 

(i)  r(D)  -  the  actual  probability  of  correct  classification  for 

k 

any  arbitrary  classification  rule  D  e  D  ,  the  collection  of 
all  classification  rules,  (defined  by  (2.2.12)). 

(ii)  r  =  sup  r(D)  ,  the  optimal  probability  of  correct  classifi- 
D  eD* 

cation  (defined  by  (2.2.13)). 

A 

(iii)  r  -  the  density  plug-in  estimate  of  the  optimum  probability 
of  correct  classification,  r*  (see  (2.3.13)). 

k 

(iv)  for  an  arbitrary  but  fixed  subcollection  D  of  D  , 

r^  =  sup  r(D)  ,  defines  the  restricted  optimum  probability 
(see  (3.3.12)). 

(v)  r(D)  -  the  counting  estimate  of  the  probability  of  correct 
classification  (see  (3.3.1)). 
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§4.1  Asymptotic  Optimality  of  Density  Plug-In  Estimators  r  . 

/\ 

Theorem  4.1  (Bias)  :  If  the  estimates  q^  f ^  ,  1  <_  i  k  are  pointwis 
unbiased,  or  more  generally  if  they  satisfy: 


(4.1.1) 


E(q^f i (x) )  >  q^  f.(x) 


for  1  <  i  <  k 


and  for  almost  all  x  e  X  ,  then 

(4.1.2)  E(r (D) )  >  r*  >  r(D) 


Proof:  The  second  inequality  follows  since  by  definition 


r  =  sup  r(D) 
De£>* 


>  r(D) 


/V 

Further,  using  g.(x)  =  q.  f . (x)  ,  1  <  j  <  k  ,  the  convexity  of 

1  11 

max  (•)  and  the  assumption  (4.1.1), 
l<j<k 

E(y (x) )  =  E(  max  g.(x)) 
l<j<k  3 


>  max  E(g  (x)) 
Kj<k  J 


>  max  (q.  f.(x)) 
l<j<k  3  J 


=  Y(x)  . 


Invoking  Fubini's  iterated  integrals  theorem, 


' 
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E(r(D))  =  E(  Y(x»  (see  (2.3.13)) 


E(Y(x)) 


> 


y(x) 


r 


(Integration  with  respect  to  y  ,  a  -  finite  measure,  is  abbreviated  here 
and  often  hereafter.) 


q. e.d. 


Remark  4.1:  The  usual  parametric  estimates  of  multivariate  normal  densi¬ 
ties  do  not  satisfy  the  conditions  of  Theorem  4.1.  The  following  is  an 
example  (Glick  [1972])  satisfying  the  conditions  of  Theorem  4.1. 

Example  4.1:  Consider  the  counting  measure  y  on  a  discrete  sample 

space  X  =  {x^,X2»...}  ,  (if  X  is  finite  then  the  distributions  are 

multinomial).  Let  n.,  be  the  number  of  individuals  from  tt.  and 

ik  i 

having  X  =  x^  ,  then  n^  is  a  binomial  random  variable  with  expecta¬ 
tion  n  f^(x^)  .  The  usual  nonparametric  density  estimates  of  f^  , 

1  <  i  <  k  ,  are  given  by 


x,  £  X 
k 


Hence 


MXJ  = 


n .  n 

l 


ik 


(see  (2.3.1)) 


i  i  k 


n 


n 


ik 


n 


and  (4.1.1)  holds. 


. 


' 


' 

■ 


The  following  theorem  gives  one  of  the  valuable  features  of 


A 

the  plug-in  estimator,  r  : 


Theorem  4.2  (uniform  consistency):  If  the  density  estimators  f^  , 

1  £  i  <  k  ,  are  themselves  probability  densities  with  respect  to  a  a 
finite  measure  y  ,  which  converge  pointwise  with  probability  one,  i. 
if 

(4.1.3)  f±  (x)  f±(x)  and  /  f±(x)  dy(x)  ^->-S  1  . 

X 

then 


(4.1.4) 


sup*  |r(D)  -  r(D)|  ~Si>  0 
D  eD 


Proof :  Let  D  e  D  be  any  classification  procedure. 


r(D) 


k  ^  k  r 

-  r(D)|  =|I  gi  -  I  8±l 

i=l  JD.  1  i=l  jD. 


k 

<  I 

i=l 


D 


8i  "  8i 


k  r 

<  l 

i=l 


8i  "  8i'  * 


This  last  bound  does  not  depend  on  the  rule  D  ,  and  hence  it  also 

i A  i 

bounds  sup  r(D)  -  r(D)|  .  Consequently, 

DeZ? 

k 

(4.1.5)  sup*  |r(D)  -  r(D)|  <  £  |g  -  g  |  . 

D  eD  i=l  ]X 


It  therefore  suffices  to  show  that  the  integrals 


r  t 


' 

1  ll 
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S.-gJ^O  ,  l<i<k  . 


X 


*i  1  p 


By  (2.3.2)  q^  — q^  ,  for  1  <  i  <  k  and  by  hypothesis  f ^  * >  f 

for  1  £  i  <  k  .  These  imply  the  pointwise  convergences 


9,  Mx)  q.  f.(x) 


i  i 


(proof  trivial) 


and 


f(x)  5±«i_>  f  (x) 


(proof  trivial) 


«  ^  ^ 

where  f(x)  =  ^  q  f . (x)  ,  estimates  the  mixed  density 

• i=l  1  1 


f(x)  =  l  q  f  (x)  .  Since  0  <  g  (x)  <  £  q  f  (x)  =  f(x)  and 

i=l  1  1  1  i=l  1  1 


f(x)  dy(x)  ,  the  desired  convergence 


f(x)  dy  (x)  1  = 

a  -  3  #  s 

gi  -  g_J  -  ^  -*->  0  ,  follows  from  Lebesgue  dominated  convergence  theorem. 


If  further, 


f (x)  <  1  then 


8i  -  8il  * 


f  (x)  + 


f(x) 


<  2 


And  sup  |r(D)  -  r(D)|  <  2k  (from  (A. 1.5)). 


D  eD 


For  an  a.s  uniformly  bounded  sequence  of  random  variables,  con¬ 


vergence  in  probability  implies  convergence  in  quadratic  mean. 


q. e.d, 


Example  4.2:  The  following  is  an  example  to  show  that  the  condition 
(4.1.3)  is  vital  to  Theorem  4.2. 


' 


■ 


*• 


to  T  «■*  T  4  V tt  f. 
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Suppose  X  -  (0,1)  and  y  is  the  Lebesgue  measure.  Let 
^1  ~  ^2  =  2  *  ^1  =  ^2  =  ^  =  ^  * 


f(x) 


f(x)  =  < 


f(x)  +  n 


if  x  >  — 
—  n 


if  x  <  — 
n 


and  Y  =  j  f  .  Then  Y(x)  a‘ S  >  Y(x)  and  f(x)  — -S-->  f  (x)  at  all  x  e  X 

Using  (2.2.14)  and  (2.3.13),  we  have  r  =  —  but  r  ~\+ 

\  n  (i)  =  1  ,  for  n  =  1,2,3 . 


Remark  4.2:  Theorem  4.2  states  general  conditions  under  which  r  is  a 

* 

consistent  estimator  of  r  .  Theorems  4.1  and  4.2  together  suggest  that 
*  * 

r  is  more  appropriate  as  an  estimate  of  r  ,  than  as  an  estimate  of 

A 

r(D)  .  Glick  [1973]  obtains  similar  results  for  sample-based  multinom¬ 
ial  classification. 


A  A 

Theorem  4.3:  Let  y  be  the  Lebesgue  measure  and  if  f^(x)  = 


n, 


A  ' ' 

—  f . (x)  ,  where  each  f  ,  1  <  i  <  k  ,  is  a  Loftsgaarden  and  Quesenberry 
n  i  i  — 

density  estimate  defined  by  (3.2.5),  then  the  corresponding  plug-in 

/s 

estimate  r  satisfies 


(4.1.6) 


?  -E-: 


>  r 


Proof:  Theorem  3.1  of  Loftsgaarden  and  Quesenberry  [1965]  asserts  that, 
for  1  <  i  <  k  , 


?i(x)  -2->  f±(x) 


at  each  x  e  X  . 


‘ 


. 

. 


. 
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Consequently,  the  assertion  of  the  theorem  follows  from  Theorem  4.2. 

q. e.d. 

The  following  is  a  consistency  theorem  for  general  parametric 
densities  considered  in  section  3.2. 

Theorem  4.4:  If  each  f ^  (x;A)  is  a  continuous  function  of  the  unknown 
parameter  A  ,  for  1  <  i  <  k  ,  and  for  all  x  e  X  ,  and  if 

A  ^  $  S  & 

A  — 1 — >  A  (true  value  of  A),  then 
P 

*  a.s  *  ,  q.m  ^  * 

r  - >  r  and  r  J - >  r 

P 

Proof ;  Continuity  of  f^(x;A)  and  (4.1.7)  implies,  for  1  <  i  <  k  , 

f  (x)  =  f  (x;A)  ~ f  (x;A*)  =  f  (x) 
i  I  p  l  i 

and 

* 

f  (x)  dy(x)  *  f  (x;A)  dy  =  1  identically. 


(A. 1.7) 

(4.1.8) 


Thus,  the  conclusions  follow  from  Theorem  4.2, 
Lebesgue  measure  is  a  a  -  finite  measure. 


since  the 


q.e.d. 


Corollary  4.1:  Suppose  k  =  2  ,  and  the  distributions  and  F2  are 

multivariate  normal  with  common  covariance  matrix  $  .  If  f^  and  f2 

A 

are  the  appropriate  multivariate  normal  density  estimators,  then  r 
satisfies  (4.1.8). 


. 
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Proof:  The  strong  consistency  of  the  parameters  follows  from  the  strong 
law  of  large  numbers,  and  the  conclusions  follow  from  a  simple  and  direct 
application  of  Theorem  4.4. 

q. e.d. 


§4.2  Asymptotic  Optimality  of  Sample-Based  Classification  Rules. 

Once  a  sample-based  procedure  is  defined,  one  question  that 
arises  is,  in  what  sense  is  the  rule  asymptotically  optimal.  Several 
modes  of  asymptotic  optimality  for  classification  rules  have  been  proposed 
in  the  literature.  The  following  mathematical  proofs  of  asymptotic  opti¬ 
mality  of  parametric  and  nonparametric  classification  rules  have  been 
adapted  from  Van  Ryzin  [1966].  We  consider  the  two  category  classifica¬ 
tion  problem. 

Let  q  and  1-q  be  the  prior  probabilities  associated  with  the 
two  populations  tt^  and  respectively.  Then  an  optimal  Bayes  rule, 

D  ,  with  respect  to  these  prior  probabilities,  is  given  by  (see  section 

(2.2.1)) 

D  *  =  {x  e  X  :  q  f^x)  >  (1-q)  f2(x)} 

(4.2.1)  D2*  =  X  -  Dj*; 

ties  to  be  resolved  in  some  manner,  as  discussed  in  Chapters  II  and  III. 

If  q  ,  f^  and  f2  are  known,  the  classification  problem  is 
solved  by  (4.2.1).  When  f^  and  f2  are  unknown,  given  random  samples 

A 

of  size  ni  from  tt^  ,  we  seek  estimates  f^  for  f  ^  (i  =  1,2)  . 


' 


;  .a  -  t  *  ,a 
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Assume  that  these  samples  are  independent  of  the  observation  X  to  he 
classified.  Let  (g^(x,y)  ;  k  =  1,2,...}  be  a  sequence  of  real-valued 
measurable  functions  defined  on  X  x  X  such  that  a.e.  y 


(4.2.2) 


8k(x,y)  fi(y)  dy (y)  <  °o  for  i  =  1,2  ;  k  =  1,2,3,...  . 


Then  form  the  estimates 


n. 


(4.2.3) 


i(x)  =  I  8n  (X>xk(l))  >  1  =  1>2 


i  k=l  i 


Assuming  these  estimates  are  good  in  some  sense,  a  reasonable  procedure 

/s. 

to  use  in  place  of  (4.2.1)  is  the  plug-in  rule,  D  ,  given  by 


=  (x  e  X  :  q  >  (1-q)  f2^x^ 


(4.2.4) 


D2  =  X  ~  °1  * 


A  ^ 

Lemma  4.1:  The  difference  in  the  Bayes  risks,  R(D)  -  p(D  )  satisfies 
the  following  inequality: 


(4.2.5)  0  <  R(D)  -  p(D  )  <  C12  q 


| f1(x)-f1(x) ]  dy (x) 


+  C21(l-q) 


|f  (x)-f  (x)|  dy (x) 


Proof :  The  first  inequality  follows  by  the  optimality  of  the  Bayes 

rule,  D  .  And  the  second  inequality  follows  from  the  expressions  for 

A  ^ 

R(D)  and  p(D  )  given  by. 


p(D  )  = 


D 


*  q  C12  fl(x)  +  jD  *  (1  q)  C21  f2(x)  ’ 


.  .  •: 


' 


' 


ypi  \o  xlii-  ra  4qo  sriJ  evo  Lo:  1.  3  .  rp?n  l  9fi  ij  _ 


64. 


R(D)  = 


a  <1  C12  fl(x)  + 


A  (i-q)  c  f9(x) 
Jd 


where  D_^  (i=l,2)  and  D  (i=l,2)  are  given  by  (4.2.1)  and  (4.2.4) 


q » e .  d « 


Remark  4.3; 
ity  (4.2.5) 

(4.2.6) 


From  Markov's  inequality  (Loeve  [1963]  pp.  158),  the  inequal- 
and  Fubini's  theorem  we  have 


P[R(D)-p(D*)  >  e]  <  e  1{C12 


] f1(x)-f1(x) I 


dy  (x) 


+  (1-q)  C21 


E  ]f2(x)-f2(x) |  dy (x)} 


Consequently,  it  follows  that  examining  Bayes  risk  consistency  (defini¬ 
tion  3.1)  of  rules  amounts  to  studying  the  asymptotic  behaviour  of 

*  /V 

E  I  f±(x)-f1(x)  |  dy  (x)  as  n±  -*  °°  ,  i  =  1,2  . 


In  the  following  theorem,  let 

s 

(4.2.7)  f . (x)  =  l  a  iK(x)  ,  i  =  1,2 

j=l  3  3 

and  for  some  finite  s  ,  where  ip . (x)  are  y  -  integrable  orthonormal 
functions  in  L^ty)  . 


Under  (4.2.7)  we  are  assuming  a  parametric  form  for  f^(x)  , 
(i=l,2)  ,  but  s  is  assumed  to  be  so  large  that  estimation  of  a  's 
becomes  impractical.  Aizerman,  Braverman  and  Rozonoer  [1964]  use  the 
estimates  f^(x)  ,  given  by  (4.2.3)  where 

s 

(4.2.8)  gk(x,y)  =  g(x,y)  =  ^  ip_.  (x)  ^  (y)  , 


k  =  1,2,3,... 


' 


* 


V 


These  estimates  are  unbiased  for: 
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(A. 2. 9) 


E(f.(x))  = 


g(x,y)  fj.(y)  dp (y) 


=  l  *  (x) 

j=l  J 


(y)  f ± (y)  dy(y) 


=  l  ip .  (x)  a.  .  =  f .  (x) 
j=l  J  iJ  1 


(using  orthonormality  of  ip .  T s  and  (4.2.7)). 

1 


Theorem  4.5:  Under  (4.2.7),  let  D  be  defined  by  (4.2.3),  (4.2.4)  and 


(4.2.8).  Then  D  is  BRC  with  D  . 


Proof:  Since 


| g (x,y) |  f  (y)  dy (y)  <  00  ,  by  (4.2.9),  the  strong  law 


of  large  numbers  and  -  convergence  theorem  (Loeve  [1963]  p.  163),  we 
have 


E|  f^(x)-f_^(x)  J  0  as  n^  ■+  00  . 


Further, 


E| fi(x)-fi(x) |  <  E  |f±(x)|  +  f±(x) 


<  l  |^.(x)|  [  |iMy)|  |f.(y)|  dy(y)  +  f  (x)  , 

J-l  J  J  3 


and  the  right  hand  side  quantity  is  y  -  integrable.  Hence  by  Lebesgue 
dominated  convergence  theorem, 


E  |f1(x)-fi(x) |  +  0  as  n^ 


-y  oo 


and  the  conclusions  follow  from  (4.2.6)  and  remark  4.3. 


q.e.d, 


4 


' 
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We  shall  state  the  following  theorem  (without  proof)  concern¬ 
ing  the  asymptotic  optimality  of  nonparametric  classification  rules,  as 

proved  by  Van  Ryzin  [1965,  1966],  Let  X  be  the  Euclidean  r  -  space 

r  A 

R  and  y  be  r  -  dimensional  Lebesgue  measure.  We  define  f^  (x)  by 

(4.2.3),  by  choosing 

(4.2.10)  g,  (x,y)  =  k(^r^) 

K  '  hf  nk 
k 

where  {h^}  is  a  sequence  of  positive  numbers  satisfying 

(4.2.11)  hk  1  0  as  k  1  » 

and  k(y)  =  k(y^,y2» • • • ,Yr)  is  a  bounded  Borel  function  on  Euclidean 
r  -  space  with 


(4.2.12) 


k(y) 


dy  =  1 


k(y)  >  0 


(4.2.13) 


1 1 y |  |r  k(y)  +  0  as  |  |y|  |  +  00 


(For  a  detailed  discussion  on  this  density  estimation  method, 
see  section  3.2.1  and  Parzen  [1962].) 


Theorem  4.6:  Let  f  (x)  be  continuous  a.e.  with  respect  to  Lebesgue 

measure  y  •  Then  the  rule  D  defined  by  (4.2.3),  (4.2.4)  and  (4.2.10) 

* 

is  BRC  with  D 

Proof:  See  Van  Ryzin  [1965,  1966]. 


q.e.d. 


' 

■ 


■ 
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Following  Van  Ryzin's  notion  of  Bayes  risk  consistency 
(definition  3.1),  Glick  [1972]  proved  the  asymptotic  optimality  of  the 

✓s 

density  plug-in  rule  D  . 

Theorem  4.7:  Subject  to  the  conditions  of  Theorem  4.2,  the  density  plug- 

A  A  A 

in  rule  D  is  Bayes  risk  consistent  (or  strongly  consistent);  and  r(D) 
is  a  consistent  (or  strongly  consistent)  estimator  of  the  optimum  prob¬ 
ability  r*  ,  i.e., 

r(D)  — •>  r  and 

a  •  s 

(4.2.14)  r(D)  -E->  r*  . 

cL  •  S 

Proof :  Theorem  4.2  immediately  implies, 

r(D)  -  r (D)  +  0 

Moreover, 


]r(D)-r  I  =  I  sup  r(D)  -  sup  r(D)| 

4  ^  *  JL.  «JL  * 

D eZT  D  cD 


<  sup  | r (D)-r (D) ]  , 

Dei) 


so  theorem  4.2  implies 


/\  /\  * 

r(D)  r 


Further, 


<  ^  <fcl  I  A  A  A  |  .  /s  X  . 

1 r(D)-r  |  <  |r(D)-r(D)|  +  |r(D)-r  | 


->  0 


(by  first  two  convergences). 
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Thus , 


A  ^ 

r(D)  r 


q .  e .  d , 


Example  4.3:  In  the  previously  considered  example  4.1, 


n., 

~  \  ik  a.s  ,  r  ,  N 

q±  fi(xk>  =  —  ~ >  qi  W 


(by  strong  law  of  large  numbers).  Now, 


* 

f(x)  dy (x)  =  l  f(x  ) 

h  k 


1  1  qi 

ki  1  1  K 


-  -  II  n.. 
n  f  "  ik 
k  l 


=  1 


Hence  theorems  4.2  and  4.7  apply,  with  convergence  almost  surely  and  in 
quadratic  mean.  (Indeed,  for  k  =  2  distributions  on  a  finite  sample 
space,  Glick  [1973]  has  proved  that  P[r(D)  =  r*]  -*  1  ,  with 
exponential  convergence.) 


§4.3  Consistency  of  Minimum  Distance  Nonparametric  Classification  Rule. 

In  section  3.3.2,  we  discussed  the  minimum  distance  nonpara¬ 
metric  classification  rule,  as  proposed  by  Das  Gupta  [1964].  We  give 
here  the  mathematical  proofs  of  the  assertions  made  in  that  section. 

Lemma  4.2:  For  k  =  2  ,  the  following  relation  holds:  For  i  =  1,2 
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(4.3.1)  ri-j/d)  1  fd^nl,4,Fl')  fd(‘n2’4,F2')  fd^no'4*Fo  FP 


(i  =  1,2)  where  d(F.,,F0)  >  3  >  0  and  r..(d)  and  f,  are  defined 

1  l  —  li  d 


by  (3.3.7)  and  (3.3.8)  respectively. 


Proof :  We  shall  prove  for  i  =  1  .  The  proof  is  analogous  for  i  =  2  . 


By  triangle  inequality, 


(4.3.2) 


A  A 


d(Fo’Fi)  -  d(Fo,Fi)  +  d(Fi’Fi)  ’  and 


d<VV  1  d(F0>F2)  -  d(F2,F2) 


(4.3.3) 


>  d(FrF2)  -  d(FQ,F1)  -  d(F2,F2) 


By  (3.3.4), 


(4.3.4) 


A  A 


d(F0’F2)  -  d(F0’Fl)  =  d02  -  d01  • 


Combining  (4.3.2),  (4.3.3)  and  (4.3.4)  we  obtain 


(4.3.5)  d02"d01  -  d(Fl-F2)  ”  d(Fi>V  " 


dffg.Fj^)  <  |  ,  d(F1,F1)  <  |  ,  d(F2,F2)  <  |  give  from  (4.3.5) 


d02  ”  d01  —  0  ' 


Consequently, 


rii<d)  =  P[d02  "  d0!  >  °  I  F0  "  Fl] 


>  Ptd^.Fp  <  |  ,  d(F2,F2)  <  |  ,  d^.Fp  <  f  /  F0  =  Fx] 


■ 
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fd(nl’4’Fl)  fd(n2’4’IV  fd(no’4’F0=Fl) 


which  proves  (4.3.1)  for  i  =  1  . 


Lemma  4.3:  For  any  i  ,  (i  =  1,2,..., k) 


(4.3.6) 


where 


1  "  rii^ 


k 

I  [i 

3-1 


Bij(d)] 


q.  e.d 


(4.3.7) 


Bij(d)  =  P[doj  >  d0i  I  Fo  =  Fi]  >  1=1’2 . k)  • 


Proof:  Let  E.  .  be  the  event  d  .  >  d  .  .  Then 
-  ij  oi  oj 


1  - 


r  (d)  =  P[  u  E  |  F  -F  ] 
if  o  i 


<  l  P[E. .  |  F  -F  ] 
—  .**-  ij  1  o  i 
J=1 


“  l  11  -  Vd)] 

J=1 


A  well-known  theorem  on  Kolmogorov-dis tance  (def.  3.5) 


q. e.d. 


states  that: 


Theorem  4.8:  The  Kolmogorov-dis tance  is  uniformly  consistent  in  the 
class  of  all  univariate  distribution  functions. 

Proof :  See  Das  Gupta  [1964] . 

q . e. d . 
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Theorem  4.9:  If  the  distance  function  d  is  consistent  (uniform)  then 
the  minimum  distance  classification  rule  <j)^^  ,  defined  by  (3.3.4),  is 
consistent  (uniform),  i.e. 

r^(d)  -*■  1  V  i  (i  =  1,2,.  . .  ,k)  as  n^  «> 
where  r^(d)  is  defined  by  (3.3.7). 


Proof:  Let  d(F.,F.)  =  £  ,  £..  >  0  .  Then,  by  lemma  4.2. 

-  l  j  ij  ij  J 

£  £  £ 

Vd)  >  fd<v-F>Fi>  Vv-f^W 

£  £  £ 

d  consistent  =>  each  of  fd(ni,-|i,Fi)  ,  ,-|^,F  )  ,  fd (n^-^.F^F^ 

approaches  1  ,  as  nQ,n^,n.  ■>  00  (by  definition  of  B..(d)). 

Consequently,  the  conclusions  follow  from  lemma  4.3. 


Similar  argument  holds  for  uniform  consistency. 


q.e.d, 


Corollary  4.2:  The  minimum  distance  classification  rule  based  on  Kolmo¬ 
gorov  distance  (in  the  univariate  case)  is  uniformly  consistent. 

Proof :  Follows  immediately  from  theorem  4.8  and  theorem  4.9. 

q.e.d. 


§4.4  Certain  Results  on  Best-Count  Discriminants. 

There  is  a  direct  parallel  to  the  bias  theorem  4.1  for  best- 
count  discriminants. 

Theorem  4.10  (Bias) :  For  any  subcollection  D  of  D  and  any  sample- 


.((b>  8  1o  nolUn.:  v<<)  »  -  < 0«>  ■  <N' 


' 

»D  t  ft  ■  ■  *  < 
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based  best-count  discriminant  D  e  D  ,  r(D)  has  expected  value  greater 
than  or  equal  to  the  restricted  optimum  probability,  which  in  turn  is 
greater  than  or  equal  to  r(D)  ,  i.e. 

(4.4.1)  E(r (D))  >  rD  >  r(D) 


Proof:  Similar  to  that  of  theorem  4.1. 


q. e.d. 


Theorem  4.11  (uniform  convergence) :  As  sample  size  n  -*  °°  ,  the  counting 

rsj 

function  r  converges  to  the  actual  probability  of  correct  classification 
r(D)  ,  uniformly  over  all  discriminants  D  in  the  subcollection  D  ,  i.e. 


(4.4.2) 


sup  |r(D)-r(D)|  0 


D  eD 


q  .m 


provided  that  are  absolutely  continuous  with  respect  to 

the  Lebesgue  measure  y  . 

Proof:  Using  (3.3.1)  and  (2.2.12)  we  have 


r(D)-r(D)  = 


k 

I  S± 

i=l 


D. 

l 


d  F  (x)  -  l  q.  d  F  (x)| 

i=l  JD. 


i  1  (q,  I  ^(x)  -  d  F  (x)  ]  |  +  |q  -q  ] 

i=l 


D 


d  F.(x)} 


k  r 

1  I  (I 

i=l 


d  Fj(x)  -  d  F  (x) |  +  | q . -q . ] }  . 
Di  JDi 


If  H( u)  is  the  collection  of  all  sets  which  are  intersections  of  at 
most  u  half-spaces  then  either  e  H( u)  or  X  -  D_^  e  H( u)  and 


■  n  6  "  1 


t 

■ 
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Fj(x) 


r  f 

d  F  (x) |  <  sup  |  d  F  (x) 
'D  Se/7  (u )  JS 


Since  this  bound  does  not  depend  on  the  particular  discriminant  D  ,  it 

also  bounds  sup  Jr(D)-r(D)]  and  thus, 

DeZ) 


~  k 

sup  ] r  (D)-r  (D) |  <  J  {sup  ]  d  F.(x) 
D  eD  i=l  Setf(u)  S  1 


d  F  (x) J } 
JS 


I/s  i  3.  •  S 

~ ! — 0  by  (2.3.2)).  So,  to  conclude  the  proof,  one 
needs  to  prove  the  convergence 


dF.(x)|^_ >0  . 

S 

Any  asymptotic  result  of  the  above  form  is  called  a  Glivenko- 

Cantelli  convergence  of  sample  measures  and  has  been  established  by 

* 

Theorem  2  of  Suzuki  [1966].  Since  for  all  D  e  D  , 

| r(D)-r(D) |  <  r(D)  +  r(D)  <  2 

and  thus  the  convergence  in  quadratic  mean  follows  from  almost  sure 
convergence. 

q.e.d. 

Remark  4.4:  Theorem  4. 10 asserts  that  the  best-count  discriminant  D  is 
asymptotically  D  -  optimal  (optimal  in  the  unrestricted  sense  if  D 
contains  any  optimal  discriminant) . 

Corollary  4.3:  Subject  to  the  conditions  of  theorem  4.7  the  best-count 
discriminant  D  is  Bayes  risk  strongly  consistent,  i.e. 


sup 


d  Ft(x)  - 


' 
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r(D)  -+  sup  r(D)  with  probability  one,  and 


(4.4.3)  r(D)  -*■  sup  r(D)  with  probability  one. 


Proof:  Similar  to  the  proof  of  theorem  4.7,  restricting  the  classifica- 

* 

tion  rules  D  to  the  subcollection  D  of  D 

q. e.d. 

Here  is  an  example  (Glick  [1975])  to  show  that  in  the  case  of 
classification  into  one  of  two  multivariate  normal  distributions  with 
common  known  identity  covariance  matrix  and  with  estimated  mean  vectors, 
even  with  simple  loss  structure  and  equal  prior  probabilities,  the  Fisher- 
Anderson’s  plug-in  linear  discriminant  is  not  necessarily  a  best-count 

A 

rule  for  the  collection  of  all  linear  classifiers,  i.e.  D  ^  D  in  general. 


Example  4.4: 


Figure  I 


•  : 


c 
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Consider  a  hypothetical  sample  of  n  =  4  correctly  classified 
bivariate  observations  from  a  mixed  population  T  .  Individuals  from  7T^ 
are  denoted  by  X  and  those  from  tt^  are  denoted  by  0  .  The  solid  line 
(perpendicular  bisector  of  the  line  segment  between  sample  means)  is  the 
Fisher-Anderson  Classifier,  (D)  ,  and  this  partition  misclassif ies  one 
of  the  three  observations  from  each  population.  But  the  diagonal  line  in 
Figure  I  partitions  the  plane  into  two  disjoint  half-spaces  and  corres- 
ponds  to  a  best-count  linear  discriminant  (D)  ,  which  classifies  correct¬ 
ly  all  of  the  sample  points. 


. 


CHAPTER  V 


General  Remarks 

In  the  preceding  chapters,  we  discussed  various  classification 
procedures  -  parametric  and  nonpar ame trie ,  and  some  mathematical  results 
on  these  rules  and  the  associated  probabilities  of  correct  classification. 
In  this  chapter,  we  make  some  general  remarks  on  classification  theory, 
which  will  be  of  some  use  to  a  statistician. 

It  was  noted  that  the  basic  idea  in  arriving  at  different 
classification  criteria  is  the  same,  namely  the  rule  minimizes  the 
expected  loss,  or  in  particular  assuming  simple  loss  structure,  the  prob¬ 
ability  of  misclassif ication,  a  natural  criterion.  After  a  discriminant 
or  classification  procedure  has  been  established,  it  is  of  considerable 
interest  to  determine  whether  the  discriminant  is  really  useful.  The 
method  of  studying  such  a  question  involves  the  use  of  confusion  matrix, 
defined  by  Massy  [1965] ,  which  provides  a  method  for  summarizing  the 
number  of  correct  and  incorrect  classifications  made  by  the  procedure. 

One  can  also  investigate  the  sensitivity  of  a  procedure  to  deviations 
from  the  assumptions  under  which  it  was  derived.  As  an  example,  we  men¬ 
tion  Lachenbruch  [1975] fs  Chapter  3,  which  is  concerned  with  the  robust¬ 
ness  properties  of  linear  discriminant  functions.  (For  details  see 
Lachenbruch  [1975].) 

In  Chapters  II  and  III,  we  did  not  dwell  much  on  classification 
into  one  of  several  populations.  There  are  two  reasons  for  this.  First¬ 
ly,  the  essence  of  the  problem  is  often  contained  in  the  two  population 
case,  and  secondly,  the  multiple  population  case  may  involve  more  complex 
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sampling  situations.  Lachenbruch  [1973]  has  considered  two  parametric 
methods  for  solving  such  classification  problems  and  studied  the 
relative  performance  of  these  two  methods  using  the  estimated  proportion 
of  correct  classification.  Kanazawa  [1974]  developed  a  nonparametric 
classification  rule  based  on  the  Wilcoxon-S tatistic  for  the  several 
population  case,  proving  its  consistency. 

If  the  number  of  p  -  variates  (dimensions)  of  the  problem  is 
too  large,  the  data  are  subjected  to  Factor  analysis  -  a  technique  that 
attempts  to  account  for  the  correlation  pattern  in  a  set  of  observable 
random  variables  in  terms  of  a  minimal  number  of  unobservable  random 
variables  called  Factors.  These  fundamental  factors  and  their  linear 
combinations  are  used  to  explain  the  observed  data.  Evidently,  this  way 
some  information  is  lost.  Considering  the  analogy  of  discriminant  anal¬ 
ysis  with  that  of  regression  analysis,  it  can  be  said  that  unlike  regres¬ 
sion  coefficients,  discriminant  coefficients  are  not  unique,  only  their 
ratios  are. 

In  most  of  the  classification  procedures,  it  has  been  assumed 
that  X  ,  the  vector  of  measurements  is  readily  observable.  However,  at 
times  it  may  not  be  possible  to  observe  every  component  of  X  on  each 
unit  that  is  sampled.  This  gives  rise  to  what  is  called  "incomplete" 
data.  It  is  worth  mentioning  that  in  such  cases  one  may  consider  a 
general  stochastic  process  instead  of  a  finite  dimensional  vector  X  . 

The  other  interesting  topics  on  classification  included  in  the  litera¬ 


ture  are  the  following: 
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(i)  Sequential  Discrimination. 

Let  be  i.i.d.  random  variables.  Observing  Xfs 

sequentially  and  knowing  that  their  distribution  is  one  of  countably  many 
different  probabilities  within  an  arbitrary  error  level,  the  general 
problem  of  sequential  discrimination  is:  how  can  we  decide  which  one. 
Sometimes  when  the  distance  between  the  formulations  is  fairly  small  the 
discriminatory  power  of  the  observed  variables  is  insufficient  for 
satisfactory  assignment  to  tt^  or  .  Several  sequential  approaches 

have  been  proposed  to  avoid  this  problem.  Suppose  that  we  wish  to  avoid 
more  than  proportion  of  errors  in  and  in  ^2  *  ‘*'S 

possible  to  obtain  independent  observations  on  the  individual  to  be 
classified,  then  Lachenbruch  [1975]  suggests  the  use  of  sequential  prob¬ 
ability  ratio  test  to  assign  to  tt  or  tt^  . 


The  variable  U(X)  of  (2.2.8)  is  normally  distributed  with 


A2  .  H  A2 

mean  —  in  tt^  and  -  — 


in  tt ^  and  variance  A“ 


,  where 

is  the  Mahalanobis  generalized  squared 
distance  (see  section  2.2.3).  The  assignment  rule  may  be  described  as 


A2  -  r1(y(1)-P(2)) 


follows:  A  sequential  likelihood  ratio  test  of  the  hypothesis  :  X e 
versus  :  X  e  tt ^  is  performed.  Observe  X^  and  calculate 


l-€. 


1-e. 


A  = 


B  = 


(5.1) 


f2(U(X1);Az) 

f1(U(Xl):A2) 


=  e 


-u(x1) 


Then,  if 


- 
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A 

X 


assign  to  7T^  and 
assign  to  . 


Otherwise,  take  a  second  observation  and  calculate 

2  f  (U(X  );A2) 

x2=  n  -s - 

i=l  f1(U(X.):Ab 

and  then  compare  ^2  to  A  and  B  .  This  process  of  taking  an  obser¬ 
vation  and  calculating  X  is  continued  until  is  less  than  B  or 

greater  than  A  .  In  general,  we  have 


-£U(X.) 

e 


-n  U(X) 
e 


and  consequently,  the  rule  is:  Assign  to  n  if  after  n  observations 

U(X)  >  -  i  In  B 


to  u  2  if 


U(X)  <  - 


—  In  A 
n 


It  is  clear  that  the  method  described  above  does  not  involve 
prior  probabilities  q^  and  q2  .  This  is  because  we  are  restricting 
the  individual  probability  of  misclassif ication.  Kendall  [1966]  sugges¬ 
ted  a  sequential  method  based  on  order  statistics.  The  usage  of  sequen¬ 
tial  discriminants  is  not  widespread  and  there  is  no  systematic  work  on 
sequential  rules.  (For  more  references  on  this  topic  see  Das  Gupta  [1973].) 
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(ii)  Logistic  Discrimination. 

For  discriminating  between  two  populations  when  some  or  all 
of  observations  are  qualitative  Logistic  discrimination  was  introduced 
by  Cox  [1966] .  This  is  found  mostly  in  medical  diagnosis  based  on  sympt¬ 
oms  and  signs  and  in  epidemiology  investigating  factors  related  to  dis¬ 
eases  with  low  incidences.  For  more  details  see  Cacoullos  [1973,  pp. 
1-14.] 

(iii)  Discrimination  between  Stochastic  Processes. 

The  papers  dealing  with  this  problem  of  discrimination  are 
concerned  mainly  with  finding  conditions  under  which  two  or  more  processes 
(i.e.  the  induced  measures)  are  equivalent  or  non-singular.  For  details 
see  Das  Gupta  [1973] . 

(iv)  Constrained  Discrimination. 

In  Chapter  II,  we  studied  the  optimal  Bayes  rules  which 
minimizes  the  expected  loss  or  the  probabilities  of  misclassif ication. 
However,  sometimes,  the  probabilities  of  misclassif ication  are  so  large 
that  the  procedure  is  of  little  practical  use.  One  alternative  is  to 
assign  costs  to  the  various  types  of  error  which  is  often  difficult  or 
impossible.  A  second  alternative  is  to  decide  the  probabilities  of 
misclassif ication  within  each  group  that  can  be  tolerated  and  obtain  a 
rule  that  satisfies  these  constraints.  These  constitute  what  is  called 
"constrained  discrimination". 

As  is  evident,  the  classification  procedures  are  all  strikingly 
different  from  one  another.  Comparisons  of  different  rules  in  similar 
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situations  should  be  interesting.  In  particular,  best-count  rule  and 
Fisher-Anderson  linear  discriminant  rule  might  be  compared  for  both 
normal  and  non-normal  data.  Counting  estimates  of  classification  prob¬ 
abilities  (the  R-  method  of  Lachenbruch  and  Mickey  [1968])  have  been 
compared  to  density  plug-in  estimates  (their  D  -  method)  in  the  case  of 
estimated  Fisher-Anderson  rule.  Lachenbruch  and  Mickey  [1968]  conclude 
that  both  the  estimates  are  similarly  biased  for  multivariate  normal 
data.  The  most  appealing  technique  is  Anderson's  modification  of 
Fisher's  linear  discriminant,  namely,  the  plug-in  linear  discriminant, 
yet  he  says  that  it  only  "seems  intuitively  reasonable". 

Many  computer  programs  are  available  to  perform  linear  discrim¬ 
inant  analyses.  The  most  widely  used  package  is  BMD  [Dixon,  1974],  which 
has  three  discriminant  analyses'  programs,  BMD  04M,  BMD  05M  and  BMD 
07M.  BMD  04M  computes  a  discriminant  function  for  two  groups  using 
specified  subsets  of  variables.  The  output  includes  group  means,  covar¬ 
iance  matrix,  coefficients  of  the  discriminant  function  and  Mahalanobis 
2 

D  .  BMD  05M  perfoms  a  multiple-group  discriminant  analysis  for  upto 
five  groups.  Output  includes  means,  covariance  matrix,  Mahalanobis' 

D  ,  coefficients  of  discriminant  functions  for  each  group  and  a  classi¬ 
fication  matrix.  It  is  assumed  that  a  priori  probabilities  are  the  same 
for  each  group,  which  can  be  a  rather  serious  limitation.  BMD  07M 
performs  a  stepwise  discriminant  analysis  on  upto  80  groups.  The  varia¬ 
ble  to  enter  or  to  be  deleted  is  selected  on  the  basis  of  one  of  three 
criteria  at  user's  option.  Output  includes  the  population  means  and 
pooled  covariance  matrix,  classification  matrix  at  specified  steps,  and 
posterior  probabilities  of  coming  from  each  population,’  among  others. 
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This  program  also  has  the  option  of  specifying  prior  probabilities. 

It  is  not  difficult  to  extend  the  classification  framework  of 
this  study  to  cases  in  which  there  are  k  classes  and  a  different 
finite  number  .L  of  decision  options.  Other  applications  of  this  gen¬ 
eralized  framework  are  suggested  by  Marshall  and  Olkin  [1968].  Finally, 
it  should  be  pointed  out  that  the  classification  problem  can  be  arrived 
at  starting  from  the  framework  of  Cluster  Analysis  -  whose  operational 
objective  is  to  discover  a  category  structure  which  fits  the  observations 
In  this  case  little  or  nothing  is  known  about  the  category  structure,  and 
all  that  is  available  is  a  collection  of  observations.  But,  on  the  other 
hand,  in  the  case  of  classification  problem,  the  operational  objective  is 
to  classify  new  individuals,  i.e.  given  the  category  structure,  the  clas¬ 
sification  problem  amounts  to  recognising  the  new  individuals  as  members 
of  one  category  or  another.  Cluster  Analysis  has  been  employed  as  a  tool 
in  scientific  inquiry  -  a  tool  of  discovery.  Biologists  give  it  the  name 
"numerical  taxonomy"  while  the  engineers  call  it  "learning  without  teach¬ 
er".  For  a  detailed  discussion  on  Cluster  Analysis,  see  Anderberg  [1973] 
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APPENDIX  I 


The  following  data  has  been  taken  from  the  '1975  world  popu¬ 
lation  sheet’  published  by  the  Population  Reference  Bureau,  Inc. 
(Washington).  The  data  is  based  primarily  on  unpublished  United 
Nations  (UN)  figures.  The  data  sheet  lists  all  countries  with  a 
population  larger  than  200,000.  The  variables  considered  are: 

1.  Birth  rate  (=  annual  number  of  births  per  1,000  population), 

2.  Death  rate  (=  annual  number  of  deaths  per  1,000  population), 

3.  Life  expectancy  at  birth  (years), 

4.  Per  Capita  gross  national  product  (US$) . 

The  data  for  variables  1,2  and  3  come  from  unpublished  mater¬ 
ials  of  the  population  division  of  the  UN.  Birth  rates,  death  rates 

and  life  expectancy  at  birth  refer  to  the  average  of  the  1970-75  per¬ 

iod.  Per  capita  gross  national  product  is  taken  from  the  International 
Bank  for  Reconstruction  and  Development,  1971  or  1972  data. 

The  two  populations  and  consist  of  developed  and 

underdeveloped  countries  (or  regions)  respectively.  The  term  ’devel¬ 
oped’  corresponds  to  low  birth  and  death  rates,  high  life  expectancy 
and  reasonably  high  per  capita  gross  national  product.  The  problem  of 
classification  amounts  to  classifying  other  countries  (namely  doubtful) 
into  developed  and  underdeveloped  with  respect  to  these  variables.  We 
consider  two  samples  of  sizes  30  and  40  respectively  from  the  two  popu¬ 
lations  tt^  and  if 2  • 


88  - 


, 


' 


[  I 

■ 


89. 


Raw  data  for  the  samples  from  the  two  populations. 
Sample  1  (n^  =  30) . 


(1) 

(2) 

(3) 

(4) 

AUSTRALIA 

21.0 

8.3 

72.0 

2980.0 

AUSTRIA 

14.7 

12.2 

71.0 

2410.0 

BELGIUM 

14.8 

11.2 

73.0 

3210.0 

BULGARIA 

16.2 

9.2 

72.0 

820.0 

CANADA 

18.6 

7.7 

72.0 

4440.0 

CZECHOSLAVAKIA 

17.0 

11.2 

69.0 

2120.0 

DENMARK 

14.0 

10.1 

74.0 

3670.0 

FINLAND 

13.2 

9.3 

70.0 

2810.0 

FRANCE 

17.0 

10.6 

73.0 

3620.0 

GERMANY 

12.0 

12.1 

71.0 

3390.0 

GREECE 

15.4 

9.4 

72.0 

1460.0 

HUNGARY 

15.3 

11.5 

70.0 

1200.0 

ICELAND 

19.3 

7.7 

74.0 

2800.0 

IRELAND 

22.1 

10.4 

72.0 

1580.0 

ISRAEL 

26.5 

6.7 

71.0 

2610.0 

ITALY 

16.0 

9.8 

72.0 

1960.0 

JAPAN 

19.2 

6 . 6 

73.0 

2320.0 

LUXEMBOURG 

13.5 

11.7 

71.0 

3190.0 

NETHERLANDS 

16.8 

8.7 

74.0 

2840.0 

NEW  ZEALAND 

22.3 

8.3 

72.0 

2560.0 

NORWAY 

16.7 

10.1 

74.0 

3340.0 

POLAND 

16.8 

8.6 

70.0 

1350.0 

SINGAPORE 

21.2 

5.2 

70.0 

1300.0 

SPAIN 

19.5 

8.3 

72.0 

1210.0 

SWEDEN 

14.2 

10.5 

73.0 

4480.0 

SWITZERLAND 

14.7 

10.0 

72.0 

3940.0 

' 
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UNITED  KINGDOM 

16.1 

11.7 

72.0 

2600.0 

UNITED  STATES 

16.2 

9.4 

71.0 

5590.0 

USSR 

17.8 

7.9 

70.0 

1400.0 

YUGOSLAVIA 

18.2 

9.2 

68.0 

810.0 

Sample  2  (r^  =  40) 


ALGERIA 

ANGOLA 

BAHRAIN 

BANGLADESH 

BHUTAN 

BRAZIL 

BURMA 

CHILE 

CHINA 

CONGO 

CUBA 

EGYPT 

ETHOPIA 

FIJI 

HAITI 

INDIA 

INDONESIA 

IRAN 

IRAQ 

JAMAICA 

JORDAN 

KENYA 

LEBANON 


(1) 

(2) 

(3) 

(4) 

48.7 

15.4 

53.0 

430.0 

47.3 

24.5 

38.0 

390.0 

49.6 

18.7 

47.0 

640.0 

49.5 

28.1 

36.0 

70.0 

43.6 

20.5 

44.0 

80.0 

37.1 

8.8 

61.0 

530.0 

39.5 

15.8 

50.0 

90.0 

27.9 

9.2 

63.0 

800.0 

26.9 

10.3 

62.0 

130.0 

45.1 

20.8 

44.0 

290.0 

29.1 

6.6 

70.0 

510.0 

37.8 

14.0 

52.0 

240.0 

49.4 

25.8 

38.0 

80.0 

25.0 

4.3 

70.0 

500.0 

35.8 

16.5 

50.0 

130.0 

39.9 

15.7 

50.0 

110.0 

42.9 

16.9 

48.0 

90.0 

45.3 

15.6 

51.0 

490.0 

48.1 

14.6 

53.0 

370.0 

33.2 

7.1 

70.0 

810.0 

47.6 

14.7 

53.0 

270.0 

48.7 

16.0 

50.0 

170.0 

39.8 

9.9 

63.0 

700.0 

38.7 

9.9 

59.0 

430.0 

MALAYSIA 
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MAURITIUS 

24.4 

6.8 

66 . 0 

300.0 

MEXICO 

42.0 

8.6 

63.0 

740.0 

MONGOLIA 

38.8 

9.4 

61.0 

380.0 

NEPAL 

42.9 

20.3 

44.0 

80.0 

NIGERIA 

49.3 

22.7 

41.0 

130.0 

PAKISTAN 

47.4 

16.5 

50.0 

130.0 

PERU 

41.0 

11.9 

56.0 

520.0 

PHILIPPINES 

43.8 

10.5 

58.0 

220.0 

RHODESIA 

47.9 

14.4 

52.0 

340.0 

SOUTH  AFRICA 

42.9 

15.5 

52.0 

850.0 

SRI  LANKA 

28.6 

6.4 

68.0 

110.0 

SYRIA 

45.4 

15.4 

54.0 

310.0 

TANZANIA 

50.2 

20.1 

44.0 

120.0 

THAILAND 

43.4 

10.8 

58.0 

220.0 

TURKEY 

39.4 

12.5 

57.0 

370.0 

UGANDA 

45.2 

15.9 

50.0 

150.0 

Data  of  the  countries 

to  be 

classified . 

(1) 

(2) 

(3) 

(A) 

1. 

Albania 

33.4 

6.5 

69.0 

480.0 

2. 

Argentina 

21.8 

8.8 

68.0 

1290.0 

3. 

Barbados 

21.6 

8.9 

69.0 

930.0 

4. 

Cyprus 

22.2 

6.8 

71.0 

1180.0 

5. 

Hong  Kong 

19.4 

5.5 

70.0 

980.0 

6. 

Kuwait 

47.1 

5.3 

67.0 

4090.0 

7. 

Puerto  Rico 

22.6 

6.8 

72.0 

2050.0 

8. 

Romania 

19.3 

10.3 

67.0 

740.0 

9. 

Uruguay 

20.4 

9.3 

70.0 

760.0 

0. 

Venezuela 

36.1 

7.1 

65.0 

1240.0 
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(I)  Parametric  Classification. 

Let  the  populations  be  normal.  The  sample  means  (see  table 

1),  inverse  of  the  estimated  covariance  matrix,  discriminant  func- 

2 

tion  coefficients  and  the  Mahalanobis  D  between  the  two  populations 
are  computed  with  the  help  of  the  computer  program  BMD  04M  (Dixon 
[1974]). 


Table  1 


Variable 

Mean  1 

Mean  2 

Difference 

Sum 

1 

17.20995 

41.2274 

-24.0174 

58.43735 

2 

9.44665 

14.43494 

-4.98829 

23.88159 

3 

71.66666 

53.72499 

17.94167 

125.39165 

4 

2600.33325 

333.00000 

2267.33325 

2933.33325 

Inverse  Matrix  of  the  Estimated  Covariance 

Matrix: 

0.00095 

0.00076 

0.00106 

0.0000 

0.0076 

0.00784 

0.00527 

0.0000 

0.00106 

0.00527 

0.00420 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Discriminant 

function  coefficients: 

-0.00842 

0.02610 

0.01524 

0.00003 

Mahalanobis  =  28.06131. 
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(a)  Classification  using  Anderson’s  rule  (2. 3.1(i)): 


By  (2.3.4)  classify  X  into  tt^  or  according  as. 


V (X)  -  X'S-1(I(1)^<2))  -  ±<;(1)^(2V  s"1^1^2’)  >  0 


(assuming  equal  prior  probabilities  and  equal  losses  for  misclassif ica- 
tions . 


1.  ALBANIA:  V(X)  =  -0.0825  <  0 


Therefore  Albania  is  assigned  to  . 

2.  ARGENTINA:  V(X)  =  0.0647  >  0 

Hence  Argentina  belongs  to  . 

3.  BARBADOS:  V(X)  =  0.0936  >  0 
Hence  Barbados  is  classified  as  developed. 

4.  CYPRUS:  V(X)  =  0.0581  >  0 
Hence  Cyprus  is  a  developed  country. 

5.  HONG  KONG:  V(X)  =  0.0074  >  0 

Hence  Hong  Kong  is  assigned  to  population  tt^  . 

6.  KUWAIT:  V(X)  =  -0.281 


Hence  Kuwait  is  underdeveloped. 


, 
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7.  PUERTO  RICO:  V(X)  =  0.0787  >  0 


Hence,  assigned  to  population 


TT 


1  * 


8.  ROMANIA:  V(X)  =  0.1159 


Thus,  Romania  belongs  to  . 

9.  URUGUAY:  V(X)  =  0.1411 
Hence  Uruguay  is  developed. 


10.  VENEZUELA:  V(X)  =  -0.1779 


Hence,  assigned  to  population  tt ^  • 


(b)  Classification  using  Mahalanobis  D  (2.3.1(ii)) 


By  (2.3.7)  assign  to  or  tt2  according  as 


(X-x(1))'  S_1(X-x('1))  *  (X-x(2))'  S  1(X-x('2)) 


where  the  l.h.s.  denotes  the  distance  of  X  from  the  1st  sample  and 
the  r.h.s.  denotes  the  distance  of  X  from  the  2nd  sample.  Let 
these  distances  be  denoted  by  and  D2  respectively. 

1.  ALBANIA:  D1  =  0.2657  ,  D2  =  0.0952. 

Thus,  Albania  belongs  to  . 

2.  ARGENTINA:  D1  =  0.0645  ,  D2  =  0.1692 


1  ‘ 


Hence,  assigned  to  population  tt 


‘ 


' 


3.  BARBADOS:  D  =  0.0374  ,  D2  =  0.2245 
Hence,  Barbados  is  developed. 


4.  CYPRUS:  D  =  0.0719  ,  D2  =  0.1882 
Hence,  belongs  to  tt^  . 


5.  HONG  KONG:  D  =  0.1868  ,  D2  =  0.2016 
Hence,  Hong  Kong  is  developed. 


6.  KUWAIT:  D_L  =  0.7949  ,  D2  =  .2327. 


Thus,  Kuwait  is  underdeveloped. 


7.  PUERTO  RICO:  D1  =  0.0558  ,  D2  =  0.2132. 
Therefore,  is  a  member  of  tt^  . 

8.  ROMANIA:  D1  =  0.0414  ,  D2  =  0.2731. 
Hence  Romania  is  classified  into  tt^  . 


9.  URUGUAY:  Dx  =  0.0121  ,  D2  =  0.337 
Hence,  is  assigned  to  tt^  . 

10.  VENEZUELA:  Dx  =  0.3994  ,  D2  =  0.0436 


Hence,  Venezuela  is  underdeveloped 


p  ^  N|£  I 
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(II)  Nonparametric  Classification  by  Nearest  Neighbor  Rule: 

First  the  data  was  subject  to  standardization  with  respect 
to  the  mean  and  standard  deviation  of  each  variable  (see  table  2) . 
Euclidean  distance  has  been  considered. 


Table  2 


Variable 


Mean 


Standard  Deviation 


1 

2 

3 

4 


30.93 

12.3 

61.41 

1304.71 


13.36 

5.03 

11.28 

1374.1 


By  definition  3.1,  an  observation  x  '  e  {X-,...,X  }  is 
J  n  1  n 

nearest  neighbor  to  X  =  x  (observation  to  be  classified)  if 

min  d(x.,x)  =  d(x  '  ,x) 

,  ,  i  n 

l<i<n 

Since  the  computations  are  tedious,  we  give  classifications  of  only  3 
or  4  countries.  Other  classifications  are  similar. 

(1)  ALBANIA:  It  is  nearest  neighbor  to  ’Cuba'  which  belongs  to 
7T  (Dist.  =  0.335).  Hence  Albania  is  classified  into  tt 2  . 

(3)  BARBADOS:  Nearest  neighbor  to  Yugoslavia  (Dist.  =  0.289) 
and  hence  'Barbados'  is  developed. 


. 
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(8)  ROMANIA:  Nearest  neighbor  to  Yugoslavia  (Dist.  =  0.255)  and 
hence  belongs  to  tt^  . 

(10)  VENEZUELA:  Nearest  neighbor  to  Jamaica  (Dist.  =  0.584)  and 
hence  is  underdeveloped. 

Remark:  With  ’KUWAIT'  we  get  the  same  minimum  distance  from  Switzerland 

and  Jamaica,  so  we  can  arbitrarily  assign  to  tt^  •  (Minimum  distance  = 
2.638). 

Thus,  it  is  clear,  that  all  the  three  rules  considered  give 


the  same  classification  of  the  countries  to  be  classified. 


